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ABSTBACT 



One techniqu* that i$ Kwetlir^ empk^ to enhance th« performance of a d»taba»« 
management system Is known as alU'lbute pai t h l Wmf . This is the process of dividing the 
attributes of a Afc into subfiles that are ttoi^ separately If storing together those attributes 
that are freouefttty requested together by transaction*, and by separating those that are not, 
attribute partitioning can reduce the number of pages that must be transferred from 
secondary storage to primary memory w order fty pi msn a transaction. 

The goal of this work U to design m e chanism! mat can automatlcaJly select a 
near-optimal attribute partition of a file's attributes, based on the usage pattern of the file 
and on the characteristics of the data m the We. Tht approach taken to this problem is 
based on the use of a fife design cost estimator and of heuristics to guide a search through 
the targe space of possible partitions. The heuristics propose a smalt set of promising 
partitions to submit for detailed analysis. The estimator assigns a figure of merit to any 
proposed partition that reflects the cost that would be incurred tot processing the transactions 
in the usage pattern if the We were partitioned m die proposed way. We have also 
conducted an extensive series of experiments with a variety of design heuristics; as a result, 
we have identified a heuristic that nearly always finds the optimal partition of a file. 

The context of tW* study is a relational database management system that can 
process transactions made against retattoro wheat physical partitioning ts unknown to the 
user, tn specifying and modeftrig this system, it is necessary to address the problem of 
optimizing nonprocedural a ueriBl made to a partitioned fmt We have derived a number of 
such optimization techniques and have provided the resufts of a number of experiments with 
them. 
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CHAPTER 1 
INTRODUCTION 



The work to be reported in this report is part of an ongoing research effort to 
develop a self-adaptive database management system. The intent of this development is 
twofold: to develop the techniques and met h o dol ogy for the construction of such systems, and 
to identify database physical design issues with techniques for their automatic and optimal 
determination. In this report, we address the problem of optimizing the performance of a 
self-adaptive database management system, in a dynamic environment where access 
requirements are continually changing, by automatically partitioning the attributes (fields) of 
the file. Attribute partitioning U the task of dividing the attributes (fields) of a file into 
non-overlapping groups and men storing each group in a separate physical file. 

L Belf-acUptivg Datubaae Manayomemt Systems 

It is important that a database system petform efficiently at alt times. Efficient 
performance requires that the physical organisation of the database match the usage pattern 
of its Users. Thus, as the database's usage pattern changes over time, its organization and its 
access structures can become obsolete, with 'Consequent degradation of performance. 
Performance degradation may also result as the database gnaws in size or as the nature of the 
data it contains changes. After some time, the performance of the database system may 
deteriorate sufficiently so as to compel a database reorganisation. Since the applications 
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programs accessing the database art contmuajh; being altered wtth new applications 
programs replacing ok* earn and rime the c e nt en t i ef the- database continually undergoes 
change, the reofgeiWHiiMe; of ttu>datijbeeiFt a^y tint l^M^Murai nwM <ae an ongoiog proeese, . 

Conventionally, the database ad ministrier de strminii when and how to reorganize 
a database. His dejJMen- w based en limned inJ e rma tion about the database and the 
transactions performed en fe eanjejpentfe Mi decision hi largely an intuitive guess. For large 
databases, a more systomaue mean s of acouiring inwrma l ion about database usage and a 
more algorithmic way of ev al uating Hit; emit of alternative cenflfwa^ons u essential. It has 
recently been proposed: thai database managem e nt systems be sotf^adepuve, end automatically 
reorganise themselves as the need arises [36, 181 Hammer [»] dtacuiaat a methodology for 
monitoring database usage patt e r n, and :4*mt$m 'the design principles for a self-adaptive, 
self-reorganizing database management system. 

A minimal capability of a self-adaptive database management system should be the 
incorporation of a monitoring mechanism that collects usage statistics while performing 
transaction processing. A database management system is weH suited for gathering and 
analyzing information on its own usage and perfor m an c e, and if the gathering ami analysis 
of the usage and perfor mance information is done a ppropriat e l y, the associated overhead can 
be minimal. In addition, a »aif*depttv« database manag e m e nt system shook! be able to come 
up with desirable physical ej^ajutattoiM ;(fc*v desirable data structures and access structures) 
based upon the collected statistics, and be aWe to evaluate the tost of each alternative 
organization in order to select an optimal physical organization for a database. Also, it is 
possible that such a systee* ought perform the necessary database reorg attfaation itself, after it 
has evaluated the cost/benefits of reorganization and the associated costs of retranslating the 
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applications programs that access the database. 

^ The Relational MocW for Pfttabaf+si 

In a self-adaptive database management system, the physical organization of the 
database is perpetually being reorganized. In order for the database reorganization to be 
truly effective, a database management system that performs self-reorganization will have to 
manifest the following two important characteristics: 1- data independence between the 
database's physical organization and the application programs that access the database, and 
2- nonprocedural access of the contents of the database, fty data independence we mean that 
users and their application programs are hot required to know the actual physical 
organization used to represent the data, so that they are free to concentrate on a logical view 
of the data. Data independence makes the database easy to use and avoids the need for 
application program retranstation every time. the database's physical structure is changed. 
Nonprocedural access also makes the database easy to use, this entails the provision of access 
languages which allow the specification of desired data in terms of properties it possesses 
rather than in terms of the search algorithm used to locate and retrieve it 

The relational model of data (Codd [9]) has been proposed as a means of achieving 
the above goals of data independence and nonprocedural access. The relational data model 
provides a simple and uniform logical view of the data that is completely independent of the 
actual storage structures and access structure used to represent and access the data- This 
makes the definition and manipulation of a database independent of its underlying physical 
organization. As a result, changes at the data organization level need not be reflected in the 
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programs that access the database. 

A relation in the retaMonaldata model is a mmed two dirnentional table that has a 
fixed number of attributes (columns) and aft arbitrary number of unordered tuples (rows). 
All the rows of the table have to be unique. A tue4e lepi easi iUwg an entry in a relation 
contains a value for each attribute of the relation. The number of attributes in a relation is 
m and the number of tuples in the relation is n. Figure I shows the relation 
ENROLLMENT for students enrolled in courses. The ENROLLMENT rotation has two 
attributes Student and Course, and four tuples (Doe, 6.035), (foe, 6.03?)* (Doe, 6.$51), and 
(Roe, 6.035) The physical realization of a relation Is often called a file, with the attributes 
and the tuples of the relation called the fields and records of jHie file respectively. Henceforth, 
we will use the term file when discussing the totality of the data in a relation, indicating that 
we are dealing with the physical representation of the relation. However, we will continue 



ENROLLMENT: 



Attribute 
Student Course 

i i i 

I Doe | &035 t 

I I I 
| Poe | 6.032 I 

Tuple || | 

| Doe | 6.851 I 

II \ 
| Roe | 6.035 | 

III 
Figure 1 The ENROLLMENT relation with 2 attributes and 4 tuples. 
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using the terms attribute and tuple for the fields and records of a file so that the two 
dimensional tabular data format of the relation will be kept In mind. 

In a relational database management system, the v user's view of the database is 
independent of the details of the database's physical organization. Furthermore, his 
nonprocedural queries are far removed from the primitive data manipulation operations for 
locating and retrieving the data. Consequently, more responsibility is placed on a relational 
database management. system than on a conventional system. This responsibility takes two 
forms: I- choosing an efficient physical organitation for the relation, and 2- optimizing the 
process of finding answers to queries made to the database, by the means of efficient and 
judicial use of the available access structures. 

We believe that the selection of a good phytkal o rg aniza ti on is the primary issue in 
relational database implementation, since the efficiency that can be achieved by a query 
optimizer is strictly delimited by the available access structures. Furthermore, the efficient 
utilization of a database is highly dependent on the optimal matching of its physical 
organization to- its access requirements, as well as tothe other database characteristics (such as 
the distribution of attribute values in it). Hence, the usage pattern of a database should be 
ascertained and utilized in choosing the physical organization. 

There are numerous possibilities for the physical organization of a relational 
database. The selection of a particular physical organization must be based on minimizing 
the performance cost in terms of both data access cost and data storage cost. The subject of 
this research Is selecting the optimal attribute partition of a relational database by utilizing 
the access pattern history of the database in order to minimize the data access cost. Attribute 
partitioning consists of dividing the attributes of a file into subfiles that are stored separately. 
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In relational terms; this means splitting a wtettOtt into m number Of iubrelations, each 
containing a subset of the attributes of the original retatton, fochf that tfct original relation 
may be uniquely reconstructed from the cottectkm of the su b re latio ns. (Strictly speaking, a 
subfile is not a relalte* in that ^^^m^'t^^ mmMmmd^ ! ^'mm'0mm-iamm\ definition 
of a subfile in the next section.} 

3. Attribute Partitioning fat tv Boiaiticmati DBM8 

Let A be the set of attributes of a relation, and let T be the set of tuple identifiers 
of the relation. (A tuple identifier is a unique identifier for a tuple in 4he relation.) The 
number of attributes hi the relation is (A) - m , and the tiutnber of tuples in the relation is 
|T| - n . Consider the coUectton of subfiles F - |F,J*f.i , «*ew each subfile F, is defined by a 
pair consisting of an attribute set and a tuple identifier set, which specifies the attributes and 
tuple identifiers that are represented in the subfile F f : (A|, T|) » %£ A , T f c T . The 
collection of subfiles F is called the clustering of the relation, and can have two bask: forms: 

1- an attribute cluster in which 

Tj - T i - 1, , M and 

M 

U A: - A , 

2- a tuple cluster in which 

Aj - A i - 1> •-• M and 

M 

u Tj-T. 

The tuples of a subfile are called subtuples. A subfile F, of an attribute cluster 
contains n subtuples, one for each tuple in the original relation. A subtuple of a subfile is 
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that part of the original file's tuple that has attributes A, . The subtuples of subfile F, in an 
attribute cluster need not all be different For example, if the relation of Figure 1 is clustered 
such that F| - (Ai , T) is a subfile with A f - [Coyr**), then the subtuples of F, will be 
(6.035), (6.032), (6.851), and (6.035). 

An attribute cluster {F,tf., in which A, n Aj - # forall i f j is termed an attribute 
partition of the relation. In this report we will limit our attention to the topic of attribute 
partitioning. (A discussion of tuple partitioning appears in Section 7.2.) Attribute 
partitioning is the task of dividing the attributes of a relation and storing each disjoint subset 
of attributes in a separate subfile. The objective of attribute partitioning is to construct an 
attribute partition of a relation that optimizes the per fo rm a n c e of the database management 
system by minimizing the cost of locating and retrieving data. Intuitively, attribute 
partitioning is accomplished by assigning attributes to the same subfile whenever they are 
consistently requested together by queries. 

In conventional database management systems (with paged memory organization), 
each tuple of a relation is stored with all its attributes together in one file. When a query is 
made to the database, all tuples that are required by the query are brought into primary 
memory by retrieving all the pages that the tuples reside on. It has been observed in 
practical database applications that a query does not usually request all the attributes of the 
file; most queries request only a few of the attributes. Problems- are presented by the 
co-existence in the same file (or equivalent^ in the same tuple) of attributes that are not 
requested by the query together with the few attributes that are. Whenever the requested 
attributes are retrieved, the non-requested attributes will also. have to be brought into 
primary memory. If only a single tuple is needed to answer such a query/then it really does 
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not matter that other attributes that are not requeu e d happen to reside In the same tuple with 
the requested attributes; m any event, only one page needs to foe retrieved from secondary 
storage. On the other hand, tnuatty mere than one tuple mast foe retrieved in order to 
answer a query. TMs means that more thwn one page must fovteBflmd from secondary 
storage. The expected number of pages that most foe retrieved far queries that require more 
than one tuple, is inversely related to the wmiber iflf ttyhs th^'fft onto one page. (The larger 
the number of tuples that can lit onto a the lower pages that need to be accessed, since 

there is then a higher profo&aftty mattwo of tt^ wauetled tupte wW flrfl on the same page.) 
The effect of the non -requested attributes is to lengthen the tuple over what it must 
minimally be, thus reducing <he number Of tuples fir page, and consequently causing excess 
page accesses when answering queries that fly a few of the attributes while accessing 

more than one tuple. Therefore tf a Hte is pa r t itio ned such that attributes that are 
consistently requested together by "queries are placed together htto Hie same subfile and 
separated from those attributes with whkfti they are not requested, then die number of page 
accesses required to retrieve these attributes wilt be reduced over the number required from a 
file that is not so partitioned. 

On the other nabi, mdiscrtmmatety separating all attributes and storing each in a 
separate subfik will also result ra exosss page accesses. This is because a query that requests 
the attributes of a subftfe together with some other attributes that are not in the subfile will 
incur more page accesses than when all these attributes are in the same subfile, since now the 
two groups of attributes reside in different subfiles and on separate pages. When a file has 
been partitioned into subfiles, queries requesting attributes stored together in one subfile will 
become less costly to answer while these quartos that have their requested attributes 
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distributed over more than one subfile wiU become costlier to answer. Intuitively, the optimal 
partition is the partition that maximizes the cost reduction for the first kind of queries while 
minimizing the cost increase for the second Mud of queries. : ' 

Attribute partitioning Is. most useful for large dat a ba ses where queries made to the 
database usually request only a few attributes of each tuple, it is conceivable that the request 
distribution that has been observed for tuples requested J^queito alio apphes to attributes 
requested by queries. It has been observed in many pratjfiical database appHcations that not 
all the tuples of a database are requested with the same frequency. The "80^0" rule of 
thumb for the distribution of tuple request frequencies ( Mailing 07]) states that approximately 
80 percent of queries request the 20 percent most frequently requested tuples in a frie. 
Furthermore, the rule also applies to the 20 percent most frequently requested tuples in the 
file: i-e. that 64 percent of the queries request 4 percent of the most frequently requested 
tuples, and so forth. If this is also true for the request frequencies of attributes by queries, 
then most requests are only for a few active attributes of the file. 

An example of a large database, where attribute partitioning may be useful, is the 
Navy Command and Control Data Base {91 This database eonsJstt of a few relations with 
many tuples per relation* Some of these relations have as many as 85 attributes and a tuple 
length of 64 words. Queries are on-line, and predominantly Involve only a few attributes. 
Some attributes of the file like the name of ships or the class of ships ere frequently requested 
by queries while other attributes like the diameter of the torpedo tube are seldom requested. 
Therefore, partitioning the attributes of the files may result In considerable savings in page 
access requirements. 
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4. Thesis Objective 

To summarise, the principal goal of this report ft to develop technique j for attribute 
partitioning in a seifradaptive relational database eavfr o nm a nt ffpr jH^» purpose, we have 
■ assumed a database management system that supports par l Woned f^et and we have built an 
attribute partitioning system consisting of a model for «he assumed database management 
system and a set of attribute partitioning beurfcttcs. The aUilbu l e partitioning heuristics 
select a partition for a database managed by a databas e manag e m en t system similar to the 
one we have modelled. Although our model is not one of any existing system, it is 
representative of practical systems Our thw« w 'Ibanml^^^i-iwI^S^has' been' to avoid 
many of the simplifying assumptions made ill previous studies, and thereby emohastte 
important aspects of realhttc database environm e n t *. We strew the need ft* monitoring the 
database management system and acquiring pa rameters Oil die database usage pattern and on 
the evolving characteristics of the database itself. We descrti>e a methodology for processing 
transactions made to a partitioned database butit with vafiow access structures, and develop 
a complete and accurate model of the cost of accessing $k%60& : 4til& jBfcihri^ such a 
transaction. Finally, we concern ©ursetf with heuristic techrrioues that utitize the acquired 
parameters and produce optimal or near optimal attribute peititiOns for the database at a 
reasonable computational cost 
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5. Thesly OrgstnUav^on, 

The rest of this report U orginiwd *i toHow. Chtfter 2 iwrwrariiei a number of 
previous studies in the area of attribute partitioning, and hi the context of evaluating them, 
argues for the need of a heuristic solution to realistic database attribute partitioning 
problems. In Chapter 3 we provide the model of the underlying database management 
system that we have considered: the physical storage stnitture, the access structures, the 
transaction model, the method of processing queries in the partitioned environment and 
techniques for the acquisition of the parameters needed for our cost analysis. In Chapter 4 
we present the cost analysis for various bask operations on* partitioned database and 
describe how to compute the daUoaseV performance cost which 4* what we wish to minimize. 
Then in Chapter 5 we present a number of attribute partiti oni n g heuristics that we Have 
devised, along with the motivation for their consideration. We also discuss the comparative 
advantages and disadvantages of each heuristic, and outline how each heuristic has 
performed in a series of experiments. Chapter 6 poses an attribute partitioning problem for a 
relation with 8 attributes, and describes its solution using the heuristics of the preceding 
chapter. Finally, Chapter 7 concludes the report with, suggestions on how to extend the 
underlying environment in order to solve more realistic attribute partitioning problems, and 
also discusses the relationship between database attribute partitioning and other physical 
database design issues. 
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CHAPTBR ft 
THE APPROACH TO DATABANK ATTRIBUTE PARTITIONING 



The purpose of this chapter is to httroduce the approach we have taken to solving 
the attribute partitioning p#e*fem, and to contrast H with the approach taken by others in 
determining the optimal attrrbut* partition. There are two major approaches to attribute 
partitioning, each approach navmg tts own merits **d fimrtattom. The two approaches are: 
1- the integer programming approach, which to me approach taken by most other 
researchers, and 2- the heuristic approach. We have chosen the heuristic approach for the 
following reasons: lr More complex database er^trortfnents can be handled by the heuristk 
approach than by the integer p rogr a mm i ng approach, 2- An optimal or near optimal 
attribute partition can be found much more efficiently by the heuristic approach than by the. 
integer programming approach, and S- Although me heuristic approach (unlike the integer 
programming approach) does not guarantee that the optimal partition will eventually be 
found, the heuristics we have employed have consistently found an optimal or near optimal 
partition for the attribute partitioning probtem. 

li- Summary of Previ ous Work Done in Attribute Partitioning 

The idea of cfusterthg attrtbutes (and abo attribute partJttoning) as a means of 
improving the performance of a database management system has often appeared in the 
literature on file design and optimization. Until the paper of Kennedy [211 there had been 
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little systematic study of this, aspect of file organization. Further, the conversion of a relation 
into second and third normal forms [10] was sometimes confused ajith attribute clustering. 
Although normalizing a relation into its normal forms may result in the clustering of 
attributes, and thereby reduce page accesses, normalization is directed towards improving the 
logical data schema rather than enhancing database system performance. It is the functional 
dependencies among the attributes that govern the splitting of relation* in the procesa of 
normalization, rather than the data's physical characteristics or the database usage pattern. 
An example of work in the area of relaUonal database normaltt*t»on is that of Delobel and 
Casey [131 They are concerned with the problem of decomposing relations into a set of 
subrelations such that the information content and logical relationships of the original 
relation schema are preserved. However, they do not consider physical database criteria that 
would result in a physically optimal decomposition of the relation schema. 

Implementations of database management systems that support partitioned files have 
been few, and have been limited to simplified environments where finding an optimal or a 
suitable partition is relatively easy to manage- Moreover, in these implementations, attribute 
partitioning has been treated only as a one-shot affair, to be determined at the initial creation 
of a file. Attribute partitioning has. not been viewed as a database organization issue that 
needs to be reconsidered periodically, where the retiming should.be done by a self-adaptive 
database management system. 

There have been a number of previous studies of attribute partitioning and attribute 
clustering (Day Oil Seppllt [321 Osman [291 Yue and Wong [391 Benner [41 Alsberg [I], 
Babad [21 Stocker and Dearnley [35, 121 Kennedy [21, 201 Eisner and Severance [141 March 
and Severance [231 and Hoffer and Severance Oft. feU) However, we feel that the results of 
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these studies are not dfreedy iwMwfrfr to a ni p y ■ *» or r ts ^lt ij C database environment- 
Some of these, have b*^ fe*i^ aea^ses which l»v# »a^ in 
order to obtain analytic w hittewi; others have beea derigft* that a<* toceyppicte or unrealistic 
in maa.y w*^. . Our M**» * > • •# t» *» >fbx nw>ftf og th» si mBttf^ att M WTp t* "* that have 
been made in prevkni* *M*ii«a^ thus tadewek* BUJre e^^ of cost 
and database access. In a<ad4ue«, we sires* the imp&fome of databaT* cost analysis and the 
acquisition of accurate paeajRMfters, in, a*. en MUM W jw e n i t whs** ftflcfi* requirements a*e 
continually changing. Tb*| aspect of the tm i blMt parttttoning problem his not been 
addressed in p*evk>u* work; Below, we preset* a sw the earlier efforU in 
attribute partitioning, i d eu ttf y ing the a n am ed e«* b^n»ent ql eaih project along with the 
thrust of its research. 

Tv/o of the e*rk>* o4#e« on attribute dusteriag fct a self-adaptive database 
management system ax#.h$ %Kft£r< and P«*mley && 8J, They. discus* the Implementation ojf 
a self-rwfganJUng database mawegement system that catf ia» out agfihule clustering. (Recall, 
that in am attribute duster apt attribute mvf agist redttftdwuty in several subfiles.) Stoc*er 
and fcearntey show that is* % database ^an«g«ne«t system wb*re storage cpst is torn 
compared to the cost of accessing th*SHbXlle», it is beneflchitto duster the attributes, since thf 
increase in storage cost **tt 1» m«e th*p offset by the saving in access cost. Although they 
do no* cejtjjne.th* am^0^. 4tafttt(Pt ttdwMunt they use, they discus* a. query processor, 
which utilizes graph thewy and wtous hflwi»U« to process queries made to a file clustered 
by its attributes. They eanfihKhr thai attribute ^ustwiog in existing database management 
systems is both viable ai^des|iablec 

Kennedy {$1, ^Qgatjitgrs a mathematical model of attribute partitioning where each 



Chapter 2 - 21 -The Approach to DB Attribute Partitioning 



attribute a, is of known length, and has probability p, of being requested by a query. The 
Joint probability that attribute* a) and tj arc requested by the same query is assumed to be 
PiPj; i.e., attributes are assumed to appear in queries independently of one another. A cost 
function based upon this assumption is derived, which reflects the expected amount of data 
that must be transmitted (in terms of number of words, from secondary storage to primary 
memory) in order to answer a query. The objective here is to choose a partition such that 
this cost function is minimized. Kennedy's model is a mathematical formulation of a 
simplified attribute partitioning problem in terms of zero-one Integer programming where the 
only parameters are p, and I, , the length of attribute a, . In addition to many other 
simplification j, Kennedy's model assumes that when an attribute is requested by a query, the 
subfile containing that attribute has to be retrieved and scanned in its entirety (rather than 
retrieving just those subtuples of the subfile that are realty needed to answer the query). 
Since in this model, optimaUty can always be trivially attained when each subfile contains 
exactly one attribute, the number of subfiles M over which the attributes are to be 
distributed has to be fixed beforehand. (Otherwise the trivial partition, defined as the 
partition where each attribute is in a separate subfile of its own, wiH always prevail.) As 
Kennedy notes, there is no way short of exhaustive enumeration (which is infeasible as shown 
in Chapter 5) to find the optimal solution even for this rather simple model. To find the 
optimal solution to the partitioning problem posed by his model, he introduces two further 

■■■■■■'.■ ■ '<■■'*'' 

simplifying assumptions in order to reduce the integer programming problem to a simpler 
problem where mathematical optimization techniques can be applied. One simplification is 
the assumption that all attribute request probabilities are equal, and the other simplification 
is that all attributes are of equal lengths. 
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in the work of>Xlsnerand $ev*r^^ two subfiles: 

a primary and a secondary sobfiie . lad* liottti v ia located ^lowaeaaopajaie^ftorage device 
characterized by differing storage cost .and ret»tevai ipeed. The attrttwites are assigned to 
each of the subfHes w*ho« i^dwwlanejr. Two term* of the otyecttwceet function that is to 
be minimized are derived, where the fim cortfmictianh a jpectat cate of the (second) more 
general cost function. The fir« cost function ts the sum of storag* charges for subtuples in 
the primary subfile, phis the cost of aosessing at^the sohttiples ressdiog in the secondary 
subfile. (The. secondary suable & aoces^ attribute which 

happens to be residing jthei*i)^tasc^^^ heaolved by existing 

integer programming techniques. The cost of ^ thh> function 

by integer programrm^siaof&heoeder '(in^MQlfeiid^ hi 

the file and Q is the setof ^ifueries inade to the database. T4*eao2oad objective cost function 
is nonlinear, and meatuses the total cost* of a«es^^raojfer.sar> d ii a wage jar aubtuptes in both 
the primary and secondary subfiles. VTItejeaixteoaitJar findhuj^at opttaml solution for the 
general nonlinear objective cost ibnetion It even^ 

function. The limitations of the model adopted by £hanr and Seoetance are apparent: only 
a maximum of two subfiles aae^itowed and the seat aasociated widt procesting a query is 
taken to be the costofacoawlng thewhole<p t in mry « 

than the cost of retrieving just dwse subtupter of the seMitetfiat atefaatty needed to answer 
the query. Furthcrmorc,?ahe ceat of ^ using the Hnear objective 

cost function grows in the ^dbe oft*he:jaia»inf%a1 W and the number of 

queries (and the cost ffSBW^ej^ this cost is 

too large for practical purposes. 
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March and Severance [23] extend the model of Eisner and Severance to some extent 
by assuming that subtuples are blocked in each subfile into fixed size pages. (The page sizes 
in the primary and secondary subfiles are not necessarily the same, but the constraint is 
imposed that the sum of the primary subfile page size and the secondary subfile page size is 
constant.) The nonlinear objective cost function they derive not only depends on how the 
attributes are partitioned among the two subfiles, but also on the page sizes selected for each 
of the primary and secondary subfiles. Besides the rather peculiar paging organization 
adopted, the model of March and Severance has the additional disadvantage that it does not 
contain an accurate model of the cost of accessing subtuples that are selected in queries. 
Rather, the primary and secondary subfiles are assumed to be accessed in their entirety 
whenever any of their attributes are requested by a query (as in the model of Eisner and 
Severance). Using integer programming techniques. March and Severance obtain the optimal 
partition for their model. However, compared to the model of Eisner and Severance, the cost 
of solving the integer programming formulation grows even faster as the number of 
attributes and the number of queries made to the database grows. 

Hoffer [18] devebpes an extensive model for attribute partitioning, in which the 
objective cost function is a linear combination of storage, retrieval, update, and insert/delete 
costs. The problem is formulated in terms of a nonlinear zero-one integer programming 
problem, and is solved by a branch and bound algorithm. In applying the optimization 
algorithm to the formulation, it became obvious that problems of even modest size were 
computationally intractable. In order to use this model to obtain solutions to realistic 
problems, it became necessary to reduce the size of the feasible solution space to a point where 
optimization becomes economically feasible. To this purpose, Hoffer and Severance [19] 
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propose an attribute paitftieAmg method tnatpf^MK an taittaraf)^ cmde, but nevertheless 
reasonable, partition of the attribute*. Thh partita b then passed as a sorting point to the 
branch and bound aigorrthm of Hoffer U81 The 4r»kiai oartMion*ng method or Hotter and 
Severance uses the chisterfa«a#j»»»sa%ortthin of *§c©arfntek et >4felMb<iM!Gh'- J»- hewisMc in 
nature, to group the attributes together into bta^s. The chM»ring a^erithm takes a set or 
objects and utilizes a measure of "similarity'' lor all pairs of the objects. It then rearranges 
the set of objects such that pairs of objects with htfge similarity measure rail adjacent or 
nearly adjacent to one another. Hence clusters (or blocks) of objects can be identified such 
that every pair of objects within the cluster cairns a large measure of similarity, and every 
pair of objects across duster boundaries carries a small measure of similarity. Hotter and 
Severance provide attributes as .object* to the c hu t erlng atg orithm . They also develop a 
similarity measure fc* any pah* c4 attributes (caM similarity 

measure), which expresses the degree to which the pair of att ribute! is used together in 
queries. The similarity measure of a pair of a^ouha Is obiaiiied as follows: A subfile 
consisting of only the two attributes is assumed. When answering a query that requests one 
or both of the attributes, the subtopics of the subfite need a> be reprieved. However, not all 
of the information retrieved is used for answering the query: some of the subtuples may not 
satisfy one of the attributes, and hence the infiatnuaioncontaim^ for the other attribute in 
this subtuple is of no use.- »Th» »s4mttar^r measure for thepair of attributes for this query is 
defined as the ratio of .;lhe*«mouot of Ufeful date ttansferred to the total amount of data 
transferred from aucb .-.a sub#ik.v The acoes« fhmlafity measuie i» derived by considering the 
set of queries, the frequency .?of each individual e^y,a«4u^^ satisfying 

each query. 
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The queries that Hoffer and Severance consider can contain only one attribute in 
their selection component. This assumption restricts the appncabitity of their techniques. 
Also, the only access path that tlwfy allow is sequential searching and therefore the subfile 
that contains the attribute of the selection co m pon e nt of the query Is searched in its entirety 
(however, only tuples required for projection are selectively retrieved from the other subfiles). 
As with the model of Kennedy, the criteria by which a partition is selected for the file is the 
fraction of useful data transferred from secendaiy storage to prinuu7 memory. Since with 
such a criterion optimaltty can always be attained with the trivial partition, as a result, the 
number of subfiles in the partition found by the chistermg algorHhm has to be specified in 
advance, and the optimization techniques of Hof^r and Severance only look for the optimal 
partition in a subset of the space of all possible partitions. There are also problems 
associated with the clustering, algorithm that they use. In Section W, we describe some of 
these problems. 

S* The Intogor Programming Approayfth fe ^ftffflm^ Partitioning 

The two approaches to attribute partitioning that have been taken are the integer 
programming approach and the heuristic approach. Most earlier work on attribute 
partitioning falls in the former category. There are two major problems associated with the 
formulation and solution of the attribute partitioning problem tot the integer programming 
approach: I- The applicability of this approach is limited because of the undue simplifying 
assumptions made on the problem environment in order to obtain an objective cost function 
that is amenable to optimization. In a realistic database environment where the file has 
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many attributes and fnaStf^ UWtUs are made to did database and tfcerrare many access paths 
available by which to awabs the data, the ii uwibei at varblb l ey and ceCTtramts to consider is 
so large that it effectively pttcJadet an integer pwayamiuing formulation of the attribute 
parttttc#ingp rU I *fiW «re assumed in ' 

the database environment, the attribute ptitltt e wiwf p r obl e m usually reduces to solving a 
zero-one nonlinear iniwger pi^rammmg problem to which no available mathematical 
programming technique om be applied. As E irwedy notes Cat no technique has been found 
(short of enumeration io 101** Om tirattcd uai1Wla»m| problem that is expressed only in 
terms of attribute request pratobitities and attribute itewftlw. JFor-cases where mathematical 
programming techniques- at* available for solving 4fc« Integer programming formulation, 
applying them to even modestly shed p re btem « tumputat tooatfy tafeaaible. 

The simpl ifyi n g ass umpthan on the ipm b hmi environment that have been made by 
previous studies f^l» a *re^eitt«« m two a«egoi*» One is a limitation .-on the complicity 
of the queries that ate made to the database. Query pr e dicat e s at* either assumed to consist 
of single equality Ysmtitetwnvw «be database usage Is crudely described by a set of attribute 
access probabilities that tnd*^ the probability ef an alti^bttte beit^ realleged by a query. 
Correlations between attrlbuat^i ecwr ence s in queries mm Jgaored. The other simplification 
usually adopted concerns <he jcomptrtniton «f tta^aa^af;^ 

amount of information tltat »w a w be transferred, in 4*»ts tegar4, the effect of blocking tuples 
(and subtuples) into pages »as been - ^ tw if rti i iity - j | a^a a a» d fe | : T^f^Ma^ihng of tuples into pages 
has the- effect of increa s i ng tin *iweW rt^mnM*nation* rai^^ access. However, 

this increase is not linear, *i«ee accessing any naraber of tuples that reside on the same page 
will result m only one page access. If these blocking effects are ignored, then the 
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partitioning problem has a trivial solution. Kennedy (Lemma 4.1, [20]) has shown that when 
the amount of information transferred is the tote criterion of the cost function, the optimal 
attribute partition is the trivial attribute partition, as described in Chapter 1. The reason for 
this is that the total access cost is non-increasing as the attributes are dispersed into an 
increasing number of subfiles, eyen if the attributes are inappropriately partitioned. Hence 
in studies where blocking effects are ignored, in order that the trivial partition not prevail, 
the number of subfiles into which the attributes are to be partitioned has to be artificially 
limited and prespecified, 

The approach to attribute partitioning that we have taken in our work is heuristic 
in nature. In the heuristic approach, an optimal or near optimal partition is found for the 
attributes by a process of stepwise minimization. An attribute partitioning heuristic which is 
based upon stepwise minimization starts wKh a given partition (e.g. the trivial partition), 
and attempts to derive from it a new partition that Is incrementally superior to the original 
one, in that the database partitioned according to the new partition will have a lower 
performance cost. When this is achieved/the heuristic further tries to improve upon the 
newly derived partition. Each time an improved partition is derived, the performance cost of 
the database is reduced. The stepwise minimization process is continued until no 
improvement can be made to the latest partition. This last partition will then be returned as 
the result of the attribute partitioning heuristic The resultant partition is not necessarily 
optimal, although it can often be argued that the partition is near optimal. (The near 
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optimally of the partition proposed hy the l i fti eia ui caw be verified by comparing the 
performance of the database rnanagement sys tem wh e n the file is optimally partitioned wich 
the performance when the Me » partitioned a» wnjgi ml by the heuristic.) Indeed, in the 
course of our experimentation with ear attil h ute pafKttonmg hcurtsties, we have consistently 
found that the resultant partition of the heuristics is either optimal, or differs only 
insignificantly from the optimal partition. 

The heuristic approach to attribute partitioning does not suffer from the two major 
problems associated with the integer programming approach. The model of the the database 
environment may be as complex as desired. The complexity of the model adopted does not 
seriously hamper the heurtttk's ability to find reasonable solutions to the partitioning 
problem (although it may affect the precise amount of search time required by the heuristic to 
find a reasonable solution); We note that, although pur model does not consider certain 
parameters that have been considered by some earlier studies (e.g. file storage cost, overhead 
cost for accessing subfiles, different access and transfer costs fbr each subfile, and the 
imposition of constraints on the allocation of attributes to subfiles), we could readily 
incorporate th est parameters into our model of the database m a n a gem ent system and take 
them into consideration, without needmg to sigmlkanUy aker our partitioning heuristics. 
(These extensions are descwbed in Chapter 7>) The heuristic approach is also relatively more 
efficient with respect to tho time needed to determine a solution. For example, the main 
attribute partitioning heuristfci that we develop m Chapter 5 operates m time that is on the 
order of the product of the number of queries in the ueage pattern and the square of the 
number of attributes in the £Ma TWs compaw very favorably with execution time of the 
integer programming approach. 
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The model of the database management system we have adopted in this work is in 
many ways a generalisation of earlier work, and although not a model of any particular 

. -. _ . > :,. ■ ■'■■ .- -■ - - -ii-E;! :■■..*;.■:....,--.■■■.■■■,■ ■ ■■ ■ -■ 

existing system, is more reflective of practical systems than earlier models. We have allowed 
more complicated forms of queries, and have also con sid ered the effect of blocking subtuples 
into pages. We allow a diverse set of access structures in our model including links, indices, 
and segments. The objective cost function that we seek to minimize is the total cost of 
answering the queries posed to the partitioned database, and is expressed in terms of the 
number of page accesses, rather than in terms of the amount of data transferred. Unlike the 
models of previous studies, two tuple* wm^ htfpen to reWt on the same page, if retrieved, 
will not incur twice the access cost of retrievmf One of them. Co n V e r s e ly, a smgte tuple that 
has its atMbutes partitioned, if retrieved m its entirety, WW cms* numerous page access*. 
Consequently, if the attributes of a Hie art uai t tUor w d hta p pfopHetery such" that attributes 
that are requested together are placed ft l«|»a1i«e r iuWWaW : tHe patfotwuitnce cost of the 
partitioned database increases. This contrasts with previous inodtb where access cost was 
determined solely in terms of total information transferred (and so for which the trivial 
partition is always optimal). In our model, we do not need to specify M, the number of 
subfiles in the chosen partition. Rather, M is un constrain ed and is determined by the 
heuristics according to the optimal partition. 

4. glook Pfogra>m of tfrf 4*trilh**f| FfWyfltlofttog Bygtcifi 

Our attribute partitioning system consists of four components. Figure I shows the 
block diagram of the system, in which each component appears as a box. The four 
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components are t- the para m eter ai q u fa Mor and tereeastor (descr i b e d m Chapter 3), 2- the 
file cost estimator (describe* *n Cfupter 4), S- the query prot i aior (described in Chapter 3), 
and 4- the attribute pa rtit io ning heuristics (described toi Chapters ft and 6). The circles in 
the figure represent the iuWwWuii of f o re tasted parameters, p separ e d by the parameter 
scquisitor and forecaster, c h a r a cter i si ng the database and to mage. Edges in the figure are 
labelled by the kind of object passed from one component to another. A brief description of 
each component follows. 

1- The parameter acquhttor and forecaster eogmimmtijf moaner* the usage pattern ' 
and the response of the database mana g w e nt syanan to the querhtt in the uaage pattern. It 
collects the statistic* needed as parameters hf the flk a* esttm^av aiid ^ <|uery processor. 
At certain points., in . time when file repositioning is initiated, the parameter acquiror 
calculates trends and makes forecasts of the darabaie utage p»mm iHtd database parameter* 
for a time mter vat tato the future. , 

2- The file cost estimator receives a proposed partition from the partitioning 
heuristics and evaluates it by finding the cost of processing each onerv in the forecasted 
usage pattern against the acc o r di ng l y partitioned file. To compute the cost of processing a 
query, the file cost estimator passes the query to die faery processor for query analysis. The 
query processor finds a method for the query and returns the method to the file cost 
estimator. A method for a query is a procedure indicating how to go about accessing the 
subfiles in order to answer that query. U»if»g the forecasted database parameters for the 
future time interval, the file cost estimator computes the number of page accesses required to 
answer the query against the partitioned file according to the query's method. Summing 
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Figure 1 Block diagram of the attribute partitioning system. 
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these costs for ail the queries in the usage pattern, the flit east estimate* obtains an estimate 
for the performance cost of the proposed parttttan which the partition would be expected to 
incur in the future time interval 

3- The query processor evaluates a query in a partitioned environment by finding a 
method for the query. It requires the forecasted parameters of the database and the file 
partition. The query processor is heuristic and the method found ll normally near optimal. 

4- The attribute partitioning heuristic* propose a suitable partition of a file's 
attributes. The proposed partition is passed Sw cott l^matleft^ the file cost estimator. 
After the cost of the proposed part it ton is estimated, Jhe 4miritaa attempt ■to come up with a 
partition that is incremental superior 4» the last pjpposed partition. This process is 
continued until a partition is found such that no other partition proposed has a better 
performance compared to it If the performance cost of the final partition is less than the 
cost of the current file partition by a margin that exceeds file repertitiening cost, the file is 
repartitioned according to the resulting partition. 
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CHAPTER 3 
THE MODEL OF TOT DATABASE MANAGEMENT SYSTEM 



In this chapter we will describe the underlying model of the database management 
system that we have assumed in our work. We will describe the storage structure and the 
access structures we have adopted for the physical representation of a relation, and the 
assumptions we have made for the purpose of reducing" the problem of attribute partitioning 
to a manageable size We wW then dfecribl the structure of the queries made to the database 
and the strategy employed to process the queries in a partitioned environment, finally, we 
will list the parameters required by the compon e n ts of our attribute partitioning system (the 
attribute partitioning heuristics, the file cost estimator, arid the query evaluator), and describe 
how these parameters are obtained from monitoring the operations of the underlying 
database management system. 

1. The File Model 

We have chosen the relational data model as the logical view of data for a database. 
A database in the relational context consists of :©ne or more refauiom. However, in order to 
make the problem of attribute partitioning imnageabte in size, w« address the reduced 
problem environment of a database with a single relation. In addition we assume that the 
physical implementation of a relation is a flat file. That U, a relation U stored as a set of 
unordered contiguous tuples in secondary memory. There arr no hierarchies of domains, nor 
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pointers from one tuple to another. Although the assumption of flat file storage structure 
may seem rather severe, we note that this is the mostftatafai way of Jtoring a relation. Also. 
some of the drawbacks of the ftet file storage structute, such as the placement of frequently 
used data together with seldom used data m the same physicsd kxaJity, u precisely wh*t 
attribute partiticming intfmdj to eiknirwte. We note here that although the work reported 
here is based on the assumption of a single retauon database and a flat file storage structure, 
the approach to attribute partitioning thai we have taken and^fht,, attribute partitioning 
heuristics that we have developed should be extendible to problems where„any of the two 
assumptions are relaxed. Specifically, if there is^* l§|ilt^ cost of 

answering a query made to a mu^i-ielati©ne^d*te§»^ structure, 

then the mam heuristic techniques that we have developed roef.be regarded as a viable 
candidate for the purpose of attribute partitioning. For further discussion of the possible 
relaxation of the above two assumptions, reier to Section 7^. 

All the subfiles of the attribute partition are assumed to reside on direct access 
secondary storage devices like disks £©1 Storage space on such device* is divided into fixed 
size blocks called pages. A page is the tnf«TT»tion quaiitum Jtraosferrtd between the disk 
and primary memory m one disk access. The accessing cost of a page is assumed to be 
proportional to the average disk seek and laiertcy tlma plus the page transfer time. Hence, 
accessing cost wilt be mdep s ndent of the sequence of page accesses. Consequently, we may 
think of the pages of a file scattered throughout the disk, with no restriction on their relative 
physical locations. .■■■■,' 

As mentioned above.; the tuples of a relation are stored unordered with respect to any 
attribute. The order m which the tuples J#re stored w# be tftefr chrohotogical order of 
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insertion into the file. This makes the problem of file maintenance due to updates, insertions, 
and deletions much simpler. If a tuple is updated, the new values replace the old values in 
the same tuple. A tuple that is deleted is joined to a pool of deleted tuples and will be reused 
for newly inserted tuples. (Such a pool can easily be maintained by threading the tuples that 
were deleted into a list.) A. new tuple that is inserted in the file replaces a tuple that has been 
previously deleted. If the pool of deleted tuples is empty, then the inserted tuple is appended 
to the end of the file (if the file occupies an integral number of pages, a new page is allocated 
to the file). 

The above strategy for overflow handling is intended to maximize the number of 
undeleted tuples per page, and keep the file size to a minimum. The cost of a sequential 
search and tuple retrieval by the link access path (described below) are inversely related to 
the (average) blocking factor (the average number of used tuples per page), and these costs 
should be minimized by keeping storage utilization in the tuple space as high as possible. 
Even with the above assumptions, poor storage utilization may still ensue if the database 
usage pattern consists of a large number of Insertions followed by aft equally large number of 
deletions. To correct this, garbage collection may be pe r fo rmed on the tuple space so the 
tuples are recompacted to occupy as little space as possible. We note here that in 
partitioning the attributes of a file, the cost of garbage collection may be eliminated from 
consideration and that if we ignore the effect garbage collection has on the subfile blocking 
factor, the optimal attribute partition is independent of the garbage collection cost. The 
reason for this is that no matter how the file is partitioned, garbage collection of deleted 
tuples requires that the entire file be brought into primary memory and shipped out back 
onto secondary storage. Since the total amount of storage is fixed regardless of how the file is 
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partUioned (excepr for page breakage at the end of each subfile, which is negligible), the cost 
of garbage collection does not enter the optimization process. On the other hand, the 
blocking factors of the subfiles do influence the optimal partition. The more frequently that 
the file is garbage collected, the fewer the number of unused tuples per page and the larger 
the blocking factor would be on the average for the file, Therefore, the optimal partition for 
the file will partially depend on the frequency of garbage collection. Since the optimal 
selection of points where the file is to be garbage collected is in itself another database design 
optimization problem, we wiH not consider the problem of the optimal determination of 
garbage collection points. (See the works of S hne t d er ma n tS5j and Yao et al [38] for a 
discussion of this problem.) We witt assume that the subfile blocking factor that the 
parameter acqirisitor prepares for the attribute partitioning heuristics is the overall average of 
the observed blocking factors throughout the planning horizon. 

We will assume that tuples are of fixed length (i.e each tuple occupies the same 
amount of storage space), so that each page has a capacity for a fixed number of tuples. This 
implies that attribute values have fixed sizes, since a normalized relation has a fixed number 
of attributes per tuple. This assumption is in correspondence with the relation being a flat 
file and is a necessity for the realization of Mnks between subtupks of the same tuple. We 
also make the assumption that each page contains an integral number of tuples, and that 
tuples do not overlap page boundaries. 

In our file model, we allow three kinds of access structures. These are: segments, 
links, and indices. An access structure is a mechanism that makes the search and retrieval of 
tuples possible. In other words, given the value of an attribute, an access structure can locate 
and retrieve all tuples having that value for the attribute. Th* access path of an access 
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structure is the way in which the structure is used in such a search. A segment is a file or a 
subfile that may be retrieved into primary memory and sequentially searched from top to 
bottom for tuples with a certain attribute value. Hence by usteg tte sequential search access 
path of the segment access structure, we can both teeate desired tuples and retrieve these 
desired tuples at the same time. A Imk u sir accea structure for retrieving tuples which have 
already been located. In other words, assume that we have a pointer or some other identifier 
that uniquely identifies a tuple by its location. Linking is the access path for deriving the 
physical address of the tuple from the Wenttfler and retrieving the tuple from secondary 
storage. Therefore the link access structure is a m e chanwm for retrieving tuple* that have 
already been identified and whose location is known. A Nnk cannot be used to search for 
tuples that possess a certain value (content retrievingX h» our file model,, each sobttiple of a 
subfile has a link to all its corresponding subtuples tot aH the other subfiles. The 
corresponding subtuples (or co-subtuples> of a subtuple are all the subtuples that made up a 
single tuple before the file (and the tuple) was partitioned. An index is an access structure for 
locating subtuples with attributes that match certain vatues. Without actually retrieving the 
subtuples. An index does, not have the capability of retrieving tuples In order to retrieve 
the tuples that have been located by an index, a link access structure is used. In our file 
model, any attribute of (he relation may have an index; which ones are actually to be 
indexed is a. separate database design issue. 
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3. Linking ftufrfoyUo ^ - 

Sequentialry searching a ftte ^ » »<M^^ to » rtfaightfoww i r d mitw> »wl we wilt not 
discuss it any further. A te^ «» ttw »tb« ha^, »*>» iwaoiw ii ra i l| i f < tlwtt i» widely used in 
our model atKi we will doscrtbe how Mnki^ i» partorimd In dvteit/ > - ! ^ 

Qnee * subtuple ha* jbeen tocrttd or re»i<vd, it wmt»»|>os»*bte to retrieve any of 
its co-subtuptes m the otr»er aobftle*. (A co-s obtupte way h»»<» beWWeved in order to we 
if an attribute of it contains • cwtain v*iu«, or in trdv to p»oJ*Bt o»e of ks ittributei.) 
Hence, wo assume that each mbtopfr has » link to »M *t> O^ooftlpte atnd that the 
co-mbtupkfs may be retrieved by linking. Lirtiit ^ » fn«m^ of r t totin g toptes may be 
classified according to purpose^ re*» M< h M, and eortiOIti ly 7^ p ur p ose of a IWfc in our 
model is to rekt* co-subtupks. te reawzatson to logical, JA the link Js derived by 
transforming trtc address of one co^bOJpteiato^rrt O 

of the Imk is one-to-one, t.e. ea«h subtopte to linked to **»«tyo^ subfile. 

Thus, there is no explkk pointer from a sabtwpie of * mbflh to eoth of its co-jubtuptes. 
Rather, the address of the co-subtuples in different subfiles can be calculated from one 
another's addr«*e*, fey subtuple or ee-*ubtup*e address we **•* the took? identifier far the 
logical address of thetupte), which is the addfmof th*OJphr reftsltvt to the base address of 
its file. When retrieving a subtuple by using a link, the lublupfcV Hapte identifier (TW) is 
translated by the file's page map table into the physical address of the page the subtuple 
resides on and the offset of the subtuple within the page. The page address is then used to 
retrieve the subtuple from secondary storage fit one page access. Note that when retrieving 
any number of subtuples that reside on the same page, the page needs to be retrieved only 
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once. Once we have the TID of a uihtuple in a subfile (this happens when the subtuple 
has been located or retrieved), we may obtain the TIDs of all co-subtuples Jn other subfiles by 
applying a transformation to the subtupfeYTlD. This has been made possible because when 
partitioning a flat Ok implementation of a relation, all coreubtuptes retain their relative 
position in their subfiles, and also because within a subfile aHsubtuptes are of the same size. 
To see what this transformation Is, let t, be the tuple with tuple number i (i.e. r, is the ith 
tuple of the file). If the file is partitioned into W suWilei, then r„, r^, m ... ,r m are the M 
co-subtuples. I-et t u be the TID of subtuple * „ , J - 1, „.,M . We want to calculate t, k , 
the TID of r, k , from ty the TID of fg i We first show hew to get the tuple number i 
from the TID ty. Let S be the system page site, lj the subtuple length in subfile F, , and 
let b) - LS/I, J be the nuntber of subtuptes per page in F, (we have assumed that tuples or 
subtuples do not cross page boundaries); then H0 J is the page number of *"„ , and 
<t 4 modS) It the offset of r H in its page. The tuple number I is therefore 

(3.2. 1 ) I - b, ttyS J Ht H mod S)/l, 

Finally, we want to calculate t ik , the TID of r ik , from i, its tuple number. Since L»A>a J 
is the page number for the subtuple r ifc in F k , and (i mod b|,) is the number of the 
subtuple within the page, we have: 

(3.2.2) Uh-SLi/baJ + lkdmodbfc) 
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3. The) Index Qyymniat^tion -^v- 

Another access structure we have considered J» our model U an index A file may 
have an index on one or more of its attrtta^et. A naeeth that oantahw an Indexed attribute, 
has indexing as art access pa& to locate subcupiet haetag fc specifi e d* value tor that attribute. 
In our model, we ha ve chosen an index to be * baamced tree, iifttre each node of the tree is 
a page. A detailed description -4$ ihe index's sowtwe may be totmd in Chan Pflend 
Blasgen and Eswaram {61 Ifbe index is very shnHar to tb« B-tree* of Bayer and McCretght 
[3] both in terms of structure and the way it is imhrtalnsd. Briefly, each norHeaf page of the 
index contains an ofderod|;iet of pairs efikey*#aMaea«talues> aiai poMters^ eatt* 3 pointer 
pointing to nodes in tlw «eM lou^ teeel^^ 

of the node the pair points to. A teaf page constets of a ksy l eafr wud by an ordered Kst of 
TIDs of subtuples that ha*e the key as the value of the indexed attrlbott in rtw subftte. The 
choice of index structure for our we^ is obviously not limited to balanced trees. Any Index 
that lends itself to usage cost analysis and which is independent of the choice of file partition 
may be alternatively used. 

To concentrate on the problem of attribute partitioning, we assume that the choice of 
indices and their structure is predetermined and chosen, beforehand on the basis of other 
criteria besides the file partition. This is not to suggest that the pr ob l ems of index selection 
and attribute partitioning are independent of one another. Indeed, the two problems are 
mutually interdependent and a better solution to the attribute partitioning problem can be 
achieved by their, simultaneous solution. The probl e m of selecting indices that befit a 
database usage pattern has been extensively analyzed by Chan 171 Our work on attribute 
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partitioning takes up another dimension of the general problem of physical database design. 

We will consider four types of transactions that may be conducted against the 
database: queries, updates, insertions, and deletions. The query and the update transactions 
consist of two c o m po nents: a selection c ompo n en t that d e term ines the tuples that are to be 
selected, and a projection component that determines which attributes of the selected tuples 
are to be extracted and returned (In the case of a query) or updated (fin the case of an 
update). The deletion transaction consists only of a selection comp on ent that determines 
which tuples have to be deleted. The Insertion transaction has no components. An insertion 
transaction Is basically a set of tuples that have to be inserted in the file Because of the 
similarities among query, update, and deletion transactions, henceforth, we will discuss only 
one of them, namely queries, in full detail. The reader should assume that the discussion for 
queries can be generalized for the other transaction types as weft. The only difference among 
the transaction types is how the projection c om po ne nt of each transaction type is processed 
after the tuples are selected. This difference in processesing the projection component will be 
delineated later. 

We have made certain simplifying assumptions on the structure of the queries 
considered in our model. The simplifications were necesshfted by the need to reduce the task 
of query cost analysis to a manageable site We have disallowed Join operations on the 
relation in queries. The boolean expression in the selection component of a query consists of 
either a conjunction made of equality conditions, or a disjunction made of equality conditions. 
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A query with just Orie equality condition ii considered to be * special case of a conjunctive 
query. An equality condition is a predicate ttf the form (a • x> , where a ts an attribute 
name, and the attribute Value x of the equality condition ii a constant or program variable 
which is known at the tkne the query is processed, "fht equality condition lit the selection 
component is used to search for all the tuples <subtuptes) in the file (subfile) that have 
attribute value x for attribute a. The projectkm component i* a set of attributes whose 
values are extracted from all tuples that satisfy the selection component and returned as the 
answer to the query. In a conjunctive query, an attribute cannot appear twice in the selection 
component, or appear both Hi the selection and projection components. Although we have 
restricted the set of allowable queries by the assumptions presented above, we have stilt 
included a large number of possible queries, encompassing many of the more frequent queries 
encountered in practical database applications. 

When a query is made to a database, the query processor does the necessary search 
and retrievals on the database and returns the answer to the query. There U a cost associated 
with processing a query. In our attribute partitioning system, we have incorporated a query 
evaluator and a file cost estimator that can analyze a given query and provide an estimate of 
the cost of answering the query. Query cost analysis is a complex task. The assumptions we 
have made on the structure of the query alleviate seme of the difficulties in query processing 
and query cost analysis. BesMej the assumption on the structure of a query, query cost 
analysis depends on t he assumptions made on the distribution of attribute values in the file. 
Query cost analysis also dependi -m the di stiibul lo n of attribute occurences in the selection 
and projection components of queries and the distribution of attribute values in the equality 
condition predicates of queries. As we have mentioned In Chapter 2, previous work done on 
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attribute partitioning made simplifying assumptions on the distribution of attribute values 
and on the distribution of attribute requests in order to keep the problem of query cost 
analysis (and hence the attribute partitioning problem) within manageable limits. We have 
also made simplifying assumptions on the distribution of attributes values and attribute, 
requests in building our model of the database management system. However, our 
simplifying assumptions are less restrictive in nature than those made in the works of our 
predecessors and are closer to the reaHUes of practical database usage. We have made the 
following two assumptions m our transaction model. 

I- We assume the fraction of tuples that satify a one predicate selection is the 
selectivity of the attribute in the equality condition. The (average) selectivity of an attribute 
of a relation is the average fraction of tuples under consideration that have historically 
satisfied an equality condition involving that attribute. In other words, the selectivity of an 
attribute is the fraction of tuples that will most probably satisfy an equality condition on the 
file. The concept of an attribute selectivity measure is an important tool for database 
modelling and query cost analysis. The attribute selectivity measure will be defined and 
described fully in the section on Parameter Acquisition. From the attribute selectivities, the 
number of tuples that satisfy an equality condition On an attribute is estimated as the product 
of the selectivity and the number of tuples in the file. Using this measure of selectivity 
avoids the halve assumption that the attribute values are uniformly distributed in number, 
and that the number of tuples satisfying an equality condition is the fetal number of tuples 
divided by the number of different values of th* attribute. Also by using this measure, we 
have avoided the simplistic assumption that attribute values of a given attribute occur with 
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equal probability in the selecti on component of queries. Although «* could have obtained a 
still better model of value distribution by noting the number of tuples that contain each value 
of an attribute in a table, the attribute selectivity measure has the definite advantage that it 
takes little storage for its preservation. The other scheme requires that a table of attribute 
value frequencies be maintained for each attribute in the file, and if there are many distinct 
values for an attribute, this table wHI consume a significant amount of storage and will also 
be very difficult to update. 

2- Since we allow the specification of queries with multiple equality condition 
predicates, it is necessary to have a measure for the joint resolving power of two or more 
equality conditions. <This measure is called the joint selectivity measure) For this purpose. 
we will assume that the appearance in tuples of values belonging to different attributes is 
independent. (I.e., the probability that value x of attribute a and value y of attribute b. 
appear in the same tuple is equal to the product of their individual probabilities of 
appearance.) Hence the fraction of tuples satisfying a conjunction of predicates 
simultaneously is the product of the fractions that satisfy each predicate, and the fraction of 
tuples satisfying a disjunction of predicates is the comp le m e nt of the fraction not satisfying 
any of the predicates of the disjunction. 

One assumption we do not make in our model, however, is that attributes occur 
independently of one another w the selection and projection components of the query. 
Neither do we make the less naroow but neverthetew stifl restrectjve assumption that the 
correlation between attribute occurence In queries is determined by joint probabilities of 
attribute occurence. We actually keep a record (in a table, called the table of query types) of 
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all queries made to the database, and the exact correlation in the occurence of attributes in 
queries may be obtained form this table. Thus we avoid making the strong (and often 
inaccurate) assumption that an attribute is requested by a query Independent of what other 
attributes are requested by that query. This table of queries is concise in that queries 
involving the same attributes but different attribute values are clustered together in one 
entry. (The number of queries in the cluster Is alio i^cordecT in the entry.) 

5. Query Proopygi^g ifi a Pft^^ft^ .P^ft^ ge 

An integral part of a database management system is a facility to decide how to 
answer queries. Since we are modelling a database management system that decides how to 
answer queries posed to the database, and since in the course of attribute partitioning we 
need to estimate the cost incurred in answering a query posed to the model database, our 
attribute partitioning system will also need to decide how to answer queries. When a query is 
made to the database, appropriate access paths must be chosen so that tuples satisfying the 
selection component of the query may be located. After the satisfying tuples are located (i.e. a 
TID list of such tuples is obtained), the same access path (or possibly some other access path) 
will have to be used in order to retrieve the tuples. 

For example, assume we have a conjunctive query Involving attributes a ( , , a L in 

the selection component and attributes a^i , , a* in the projection component made to a 

partitioned file. In order to answer the query, the subtuples that satisfy the equality 
conditions on a | , ....... a L need to be located ft»y creating a TID list pointing to the subtuples), 

and then their co-subtuples containing attributes a^i , — , a* have to be retrieved so that the 
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yakie of projection attribute* *t,| -, ...^m^ m^^^mmm^im^w^mwA. ■ Assume that 
there are indices avaitebtem sc*ne or the attributes rn^^^ti^ Wm^^^proaxd to locate 
the subtuples that satisfy the selection cunitionei it h» eINter of the tafo toftowing ways (in the 
rest of this section, we d&fto» e^phottly s p e tity whenth* Uaws(oiw wiioBt #&i»9.g.2 a*e to he 
performed on TlDs of subtuples to get ^ie Ti&i ^ co- si«btopl e s; we assume the 
transformations are pe rform e d whenever necessary): 

1- Use all the applicable indices to retrieve th^T^ 

indexed attributes, intersect these TIP ma (because the query h c o njuncti ve), and from the 
resulting TID list link to the subfiles that contain any of the unJn4eat ed attributes a, , ...„, a L 
(an applicable index is an index on a selection am Itarte). Subflhi are accessed one at a time. 
Everytimr a subfile is accessed, its subtoptes with TID^ 

and checked to see if they satisfy the equality condition* >mt the unin d ax ed attributes. The 
TIDs of subtuples that do not satisfy any of the unlndexed sttrthutat as* then pruned from 
the TID list (i.e., the TID Hst of the subtaphs that satisfy ah of the unmdexed selection 
attributes in the subfile is intersected with the oW TID list). After all the subfiles containing 
selection attributes have been accessed, and the TIDs in the hst have been tested to satisfy the 
equality conditions, then ah subtopics with TfDs in the Ihf sot l eWev e d (again by linking) 
from all the subfiles containing projection a ttributes, and the projection attributes are 
extracted. 

2- Use none of the indices. Sequentially search one of the subfiles containing 
selection attributes, and create a TID list of the subtuples that satisfy all predicates involving 
the subfile. Thereafter, using the TID list, link one by one to the subfiles containing the 
remaining selection attributes, until a TID list of subtuples satisfying the entire selection 
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component of the query U obtained. Finally, Hnk to uibfltet conUlning projection attributes. 

Each of the above two schemes may be thought of as a step by step procedure where 
at each step an access path (sequential searching, indexing, Unking) is performed in order to 
obtain the TIDs of subtuples that satisfy one or more of the equality conditions in the 
selection component. Let us call the act of obtaining TIDs of subtuples that satisfy the 
equality condition on an attribute the act of resolving the attribute. Hence each of the above 
two schemes is a step by step procedure where at each step an index is used to resolve ah 
attribute, or a sequential search/link is used to resolve one or more attributes in one subfile. 
We define the method of a query to be such a step by Hep procedure where at each step an 
access path is used in order to resolve one or more attributes. • 

A query usually has many different method*, for example, in the two schemes 
above, we chose either to use all the indices, or to use none. We might have chosen to 
resolve some of the indexed attributes (in the selection component of* the query) by indexing, 
while resolving the rest of the indexed and unindexed selection attributes by linking. 
Similarly, when linking to subfiles, the subfiles wtn be accessed in some sequential order (l,e. 
one subfile is linked first, another subfile second, etc) Each distinct subset of applicable 
indices and each distinct subfile sequence constitute a method of the query. (Hence each of 
the two schemes above may be translated into many methods as the sequence of linking to 
subfiles is instantiated.) 

There is a cost associated with a query's method Depending on what indices are 
used and in what sequential order the subfiles are fmked/ the cost of answering the query will 
be different. For exampkf assume that hi resolving the attributes of a query, two subfiles 
have to be linked and when each subfile is linked, the Site of tne TlD list will be reduced by 
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an equal factor. Then It 4$ tetter to H»k te the subrhe *i!§i «be s m a i i r number of pages 
before we tink to the subfile with the larger nundier *f pages. Although to the first method 
the second hnk wiH result to more pay accesses than d* s e c o nd hnk in the first method/the 
fust link hi the second mem e d wJHrewm m even nwj frpage st aai sai Hum the first link in the 
first method. Therefore, it is important thai a qtsery piapjsser costsider e^ ^ methods of a 
duery and select the method which results in <he i melh n t WMm ha r of page accesses when 
answering the query. The optimal method of a query art« depend on »*he attributes in the 
selection component of the query, the attribute* in the projsctien component, the attribute 
setecttvlrtes and lengths, the attribute partition, and on ether ■■(■hew parameters. A query 
processor will have to consider ati these pa ram e^ rs when t^*ee<*nf a method Cor a query. 

The purpose of this section H to present how our attrfam partitioning system goes 
about choosing a method for the query made to the parythmed database . Before we- present 
our strategy .for choosing methods , we deiktea* and describe the different phases of query 
processing; the first phase h Ibe phase in which Ibe query processor decides on the optimal 
method of the query 

Processing a query made against a partitioned database with ■*, single relation and in 
which no Joins or aggregate operators aw involved consists it three phases: I- query 
evaluation, 2- query resolution, and S- query aoswertog. 

1- Query evaluation - Query evalu^kin » ti«pre<eMctfiJi»ding the optima! method 
for a query. In an environment where the ft* is partitioned *nd a#^^ 
means: I- selecting the indices to use hi answering the query, which couid be selecting all, 
none, or some of the applicable indices, 2- selecting the sequence of accessing those subfiles 
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that contain selection attribute* not resolved by means of Indices. (Note that if no index is 
utilized by the method, then the first subfile of the method will be sequentially searched, 
while the remaining subfiles will be linked.) The agent for finding the optimal method for a 
query (or finding a suitable method in case the optimal method is difficult to find) Is the 

query evaluator. The query evaluator chooses a method with the objective of minimixlng 

. „,- ■ . . i ' • ' *-■; - ■■■■•■ 

page accesses when answering the query. Later in this section, we will present the strategy 
used by the query evaluator of our attribute partitioning system. The method chosen by our 
query evaluator is not necessarily the optimal method for the query, although we will show' 
that our strategy results in near-optimal methods. When a satisfactory method ii found for a 
query, we say that the query is evaluated. 

Note that in our model of query evaluation, the query evaluator does not take into 

consideration the projection attributes of the query- Strictly speaking, the query evaluator 

' • . . .". ..■;..- .■• -,■• -.-"-.' W-; f-iS--^' '<*. \'">\ x { '■■■■■ ■ . ■ = '■ ■• ! '-' 

should also take the projection attributes into consideration and the method of the query 
should specify the sequence of linking to the subfiles containing projection attributes. This is 
because the cost of answering a query is influenced by the sequence in which projections are 
made. For example, if a subfile contains both selection attributes and projection attributes, 
then it is beneficial that this subfile be linked but in the method; since if the subfile is linked 
last, both the selection attributes may be resolved and the projection attributes may be 
projected concurrently. If this subfile is not linked last. It wiH be linked once for resolving the 
selection attributes and another time for projecting the projection attributes. (Note that each 
time the subfile will be linked from a different TID list.) We have eliminated projection 
attributes from consideration when evaluating a query. We do this because: I- considering 
projection attributes will make the problem of query evaluation still more difficult, and 2- 
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because we believe that. thei egmtuct in which the suofilea a?e a co nied in resolving attributes 
has a more profound influence en the cost <f -i tmr tug ■» query than ttie sequence in which 
the subfiles are linked for projecting attributes* 

The query, e val ua ti o n performed by ow query H^alua fca T does -not- entail any 
input/output operations (te. page accesses). The query evaluator does not need to know 
about the actual data contents of the subfiles; it only requires the various parameters 
prepared by the parameter acquisitor. The query evaluator evaluates a query by choosing a 
method for it, utilizing some strategy. One such strategy is exhaustively enumerating all 
possible methods for the query, estimating the cost of answering the query according to each 
method (by using the file cost estimator), and then choosing the optimal method. We have 
discarded this strategy because it is computatiottafty intractable to consider all possible 
methods for a query. This is especially true when making a query against a database 
partitioned into many subfiles and/or if there are many indices available. The strategy we 
use for query evaluation "is instead based upon choosing a near-efitmel method without 
requiring extensive analysis of the query. 

Query evaluation is the only phase of query processing that is an optimization 
process. The other two phases of query processing do not attempt to optimize the cost of 
processing a query. Query evaluation is the only phase actually performed in our attribute 
partitioning system. The next two phases are eitly performed by a database management 
system when it actually processes a query. The reason our attribute partitioning system 
evaluates queries is that the method of a query it required iflfe .mim--lm estimate the cost of 
answering the query. The query evaluator our system supplies the method of the query to the 
file cost estimator, which computes the cost of locating the s e lected tubtuplBS (according to the 
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method) and the cost of retrieving the subtupies needed for projection. (The attribute 
partitioning heuristics require the cost of answering all the queries in the database usage 
pattern. See Section 4.4 for a detailed discussion on how the file cost estimator estimates the 
cost of answering a query.) 

2- Query resolution is the process of locating the set of tuples that satisfy the 
selection component of the query. A query is resolved when all the selection attributes are 
resolved and a list containing the TIDs of all satisfying tuples is produced. After a query is 
evaluated, the query is resolved by accessing the indices specified In the query's method and 
performing the link to the subfiles In the order specified in the query's method. In each step 
of the method, the access path specified in the step ii actually performed and a TIP ..Hat is 
created of subtupies that satisfy the equality condition predicate of the attributes that are to 
be resolved in that step. For a conjunctive query, this TID list is intersected with the (old) 
TID list that is the result of the preceding steps of the resolution process. If the query is a 
disjunction, the union of the new TID list U taken with the old TID list The final TID list 
obtained from the last step of the method is the result of the query resolution phase. In the 
process of query resolution, page accesses are made to secondary storage when performing an 
access path. 

3- A query is answered when tit subtupies containing projection attributes that are 
pointed by the TID list are retrieved into primary memory and the attribute values of 
attributes specified in the projection component of the query are extracted and returned. 
This phase of query processing involves only input/output operations and no internal 
processing. As previously mentioned, if the hut subfile that is linked in the resolution phase 



Chapter 3 -&- 1Hr«%dtel of the DBMS 



contains projection attrfomigs, then tt is posKibte to «tt»r «w » w«f^ tb« q ue ry before the query 
is completely resolved by extnH^tWf tr^ p i ^ c^e c th w i ^WfthMte v»hm from a subtuple when the 
selection attribute vafcies c* me i&htu^ the query 

resolution and the qiwry atWWing phmswmay overtop on meto* subftte fn the method. 

In the rest of this section we wW discuss the qtiery evaluation strategy the query 
evaluator of our attribute pmrtittoninf system uses, l^iding a satisfactory method for a query 
in a partitioned environment is an involved Oak. Unlike query evaluation in an 
unpartitioned environment where the query evaluator has e«iy to choose the optimal set of 
applicable indices, a query evateater in a partitionedT*nv iro iwne nt in addition has to choose 
the sequence of linking to the subfiles. Our query evetoator is a heuristic evaluator that 
finds a satisfactory method for the query without resortaag to cost estimation. The method 
obtained by the query evaluator fa not ncosssarth; the a yUn i al method tor that query, 
although <in the course of our work) we have ^towNMl *> t&imMQltm&^Vif* ■*&■■■&& 
discuss the query evaluation strategy for conjunctive- queries. Tbei w afUr we discuss in what 
way the strategy used for disjunctive queries is different 

Query evaluation ti&mifa > v<Mm stages. -**i me first stage, the query evaluator selects 
the subset of applicable indices to include m the method. After this has been determined, the 
query evaluator has to chooser sea^ the 

rest of the selection attributes. (v 



I- Depending on the attributes in the selection component, their selectivities, and the 
attribute partition, it may be beneficial to use none, all, or a subset of the applicable indices. 
We believe that for most queries, using either none or all of the applicable indices will lead to 



Chapters -59- The Mode) of the DBMS 



satisfactory methods. Also in order to reduce the problem of query evaluation to a 
manageable site, we will restrict our attention to. the above two choices. 

One criterion by which we may Judge the effectiveness of utilizing the indices to 
process a query is the Joint selectivity of the indexed attributes that occur in the selection 
component of the query. Assume that: I- the indexed attributes are not jointly selective (i.e., 
the joint resolving power of the indices is low and t large fraction of the tuples will be 
selected so that almost all the pages of a subfile that is linked thereafter have to be retrieved), 
and assume that 2- a subfile that contains an Indexed attribute also contains some other 
unindexed selection attributes. Then such a subfile wilt mott likely be accessed in its entirety 
in order to resolve the unindexed attribute. Therefore, the indexed attribute in the subfile 
can be resolved by the link at the same time the unindexed attribute is being resolved and 
with no extra cost. Hence when the indexed attributes are not Jointly selective, using the 
indices will not save in the number of pages accessed. 

Thus, when the Joint selectivity of the indexed attributes is not too low (which is the 
case for the great majority of queries), the query evafuator will choose to use the full set of 
applicable indices. This is because the cost of resolving an attribute utilizing an index (if 
available) on the attribute is usually a fraction of the cost of resolving that attribute by 
linking to (or sequentially searching) the subfile containing it. This can be true even If the 
subfile containing the indexed attribute contains other unindexed selection attributes and has 
to be eventually linked: If the indexed attribute and the unindexed selection attributes 
residing in the same subfile as the indexed attribute are resolved simultaneously by linking 
from a TID list to their subfile, there may be more pages accessed than when the indexed 
attribute is resolved first using the index, the TID list pruned and reduced (as the result of 
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the indexing), and then the suWikfttnked to rw*»V# the omndeaed selection attributes. 

Whether all the applicable Indices '•*r# J uied or none of the applicable indices arc 
used, the query evaluate* will baw-n*'tl l ie*se*a !Aiif|M«^'1^' f ltf»idR%' : ^a : kil»files containing 

unresolved -aelectlow:»ai**rtit^ 

2- The sectmd suge of query evahMition begin* when the fcwHces that are to be used 
have been chosen. The query evahiator wilt then ha^ to link to llie subfiles containing the 
unresolved selection attributes starUng from the TID list that is the resuk of the indexing. 
Eveiytime a subfile containing *n unresolved setaptie* attribute is linked, the TID list is 
reduced to a T1D list of tuples that saUsfy trw sekctiton attrfetrtat^in the sub fi te to addition to 
the previously resolved attributes The subfiles containing unresolved selection attributes are 
linked in succession, producing successively mow refined TID lists. When a* the subfiles 
have been linked, the query is resolved and the TID h*t poinfc to Mw selected subtuples. The 
task of the query evahiator in this sUge of query evaluation is to find the optimal sequence of 
linking to subfiles. Note that the query evahiator does riot actually f>erfbm the wnkmg. The 
query e valuator only decides on the sequence of hnking te the subfiles. R lithe query 
resolver that actually performs the linking (in the sequence decided by the query evakiator) 
and retrieves the subtuples from the subfiles. The query evahNxor iMf need to know the 
expected cost of linking to subfiles when deciding of* the sequence. An estimate of the 
expected cost of linking to subfiles can be obtained without actually per forming the linking. 
In Chapter i we describe the. function used for this co« t*bT»tton. This function translates 
the number of tuple retrievals into the number of page retrievals and onb/ requtres the size of 
the TID list from which the linking is performed. The size of the TID list is readily 
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available as the product of the Joint selectivity of the attrikmt** resolved so far and the total 
number of subtuples in the subfile (the Joint selectivity of a set of attributes is obtained by 
expression 3.6.1 from the individual attribute selectivity). Also note that if no index is 
chosen in the first stage of query evahitttan, the first subfile of the sequence is sequentially 
searched (which is tantamount to linking tq the subfile from a TID list containing all the 
T IDs of the subtuples in the subfile). 

The criterion for optimization in this stage of query evaluation is the minimization 
of the total number of page accesses when answering the query. Depending on the sequence 
chosen in this stage, the method of | query may be optimal or highly nonoptimal. Therefore 
it is important that the query evakiator use a querj evaluation strateg? which guarantrej that 
the sequence chosen is close to optimal for most of the querlc* evaluated. As we mentioned 
before, exhaustive enumeration of all K! possible subfile sequences (where K is the number 
of subfiles containing unresolved selection attributes) is oift qf the question because cost 
estimating all of the sequences J$ cornputaUonaHy tatracUble. Due to the large search space 
(of possible sequences) and the numerous parameters that have to be considered in choosing a 
sequence, finding the optimal subfile sequence \\ * difficult task> However, we may 
qualitatively arrive at desirable sequences by considering the following criteria when deciding 
on the subfile sequence. I- Subfiles that can have their selection attributes resolved without 
incurring too many page accesses should be linked prior to linking to subfiles that incur 
many page accesses. That is, at each step where a subfile Is to be linked, the query resolver 
should link to the subfile that results in the smallest number of page accesses. Equivatently, 
this means linking to the subfile with the largest blocking factor (number of tuples per page), 
since the subfile with the largest blocking factor wi|| result in the fewest pages accessed, (To 
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see this, we refer the reader to the discussion of thf page access function presented in Chapter 
4 and expression 4.2.l.b. In this expression, for fixed n and fixed r , Mnjbjr) monotonicaily 
decreases as b .increases.) 2- The subfile than makes the Joint selectivity of the resolved 
attributes become highest (most selective) should be United first T*at is, consider the joint 
selectivity of the unresolved selection attributes of each subfile and **lec| Ihe subfile with the 
highest joint selectivity (i.e. the subfile that reduce* the TID Dst the most) to be linked next. 
In this manner, the overall joint selectivity whii tend <o become high as early as possible, 
causing the TID list of satisfying subtuples to be reduced e and fewer page accesses to 
be incurred as the query revolver goes to the next step of the method. The above two criteria 
can be conflicting requirements. A subfile may have low blocking factor but high joint 
selectivity for unresolved selection attributes, white an subfile may have large blocking 

factor but low joint selectivity. .' 

Based upon the above criteria, we have developed rive query evaluation strategies 
(heuristics) for choosing the subfile sequence. Each strategy is 'baaed upon one of the above 
criteria or uses a function of both criteria to rank the subfiles in some sequential order. 
Needless to say, we do not expect that any single strategy would be able to find the optimal 
sequence for all queries made to a database which is partitioned VI any manner. However, 
we require that the sequence chosen by a good strategy never to be far from the optimal 
sequence. In order to compare the different strategies which we present, we have conducted a 
set of experiments on each of the strategies. In order to determine to what degree the 
determined strategies are optimal and to what extent they may serve the purpose of query 
evaluation, we have also applied the set of experiments to two other "control" strategies, and 
compared the results with the results of the five strategies. The five strategies considered are: 
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(a) Least Page Access (LP A) - In this strategy, the subfile that results In the least number of 
page accesses is linked. That is. when there are a number of subfiles containing 
unresolved attributes, the query evaluator chooses to link to the subfile that would result 
in the least number of page accesses. Th|s is in accordance with the first of the two 
ordering criteria' discussed above. Intuitively, Unking the first few subfiles will result in 
not too many page accesses, and as the subfiles that incur many page accesses are linked 
further on, the joint selectivity of the attributes resolved so far will be sufficiently high 
such that not too rnany page accesses will be made to resolve the remaining attributes. 
As mentioned above, this strategy amount* to sequencing the subfiles according to 
decreasing blocking factor. 

(b) Least Page Access by Pairs (LPAP) - In this strategy, the query evaluator looks at all 
ordered pairs of subfiles. For each pair, the query evatuater computes the cost of linking 
to the first subfile of the pair and adds to it the cost of subsequently Unking to the second 
subfile. The computed cost for all the pairs is compared and the query evaluator selects 
the pair with the least cost to be the next two subfiles that are linked in the method. 
Note that when the query evaluator computes the cost of each ordered pair of subfiles, 
the second subfile wiU be Unked from a subset Of the TID ttst from which the first subfile 
is linked. Thli is because after Unking to the first subfile, the TIDs of subtuptes that did 
not satisfy the selection attributes in the first subfile are pruned from the TID list. 
Thereafter, the query evaluator reapplies the LPAP strategy to the remaining subfiles to 
select the next two subfiles that are to be linked hi the method. The reapplication is 
repeated untH alt the subfiles have been sequenced. Everytime a pair of subfiles is 
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selected, the TI© |t* is reduced t© a snwtler TJ& li* of tuples that in addition satisfy the 
selection attributes of 0ie pair of «ibfito Jwt tl e ct a d . 

The LPAP strategy is rtrnnar to the LPA strategy to that the criterion for 
sequencing is the number e£f^ accesses, f hywawf > this at rat a gy te oks at two subfiles at 
a time and also considers the Joint selectivity of the unr es olved selection attribtttes of the 
first subfile in choosing the subfile pair. Thereto*, «W» strategy wiH always result in 
better methods compared to the methods chosen by the LPA strategy. Observe that if 
there are only two subfiles that haw to be sequenced 1m»£m second *tage of query 
eva luation, then this strategy witt find the < 



(c) Highest Subfile Selectivity (HSS) - In this strategy, subfiles are sequenced according to 
their resolving power: The subfile containing unresolved selection attributes with 
highest joint selectivity is chosen to be linked first, and the subfile with the second 
highest joint selectivity is linked second, etc. This is in accordance with the second 
ordering criterion discussed above. The idea here is to reduce the size of the TID list as 
fast as possible. 

(d) Highest Selectivity and Least Pages (HSLP) - 1* is desirable to order the subfiles both 
according to the joint selectivity of the sefectton a ttributes and according to the number 
of pages accessed when linking to them. The previous strategies chose one or the other 
as - the ordering criteria. This ..strategy wmfeine* the two criteria by ordering the subfiles 
according to the (increasing) product of the joint selectivity of the selection attributes and 
the number of page accesses incurred in Unking |o the subfile; the subfile with the least 
product is selected and the strategy is reapplied to the remaining subfiles. Everytime the 
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strategy is applied to the subfiles, the subfile with the least product is selected arid the 
number of subfiles that are to be sequenced is reduced by *ne. TWs ttrategy, is based 
upon the assumption that considering both criteria w»l result in a superior method 
compared to a method that U found using a single crttertpu Nsge that in this strategy, 
everytime a subfile is chosen, the TID list is reduced to reflect the resolution of the 
attributes in the newly chosen subfile &e. the Joint selectivity of attributes resolved so far 
is multiplied by the select! vities of selection attributes in the chosen subfile). Thereafter, 
when choosing among the remaining subfile*, the number of page accesses incurred in 
linking to a subfile is computed from this reduced TID list 

(e) Highest Selectivity and Least Pages by Pairs (HSLPP) - This strategy Is like the Highest 
Selectivity and Least Pages strategy except that all ordered pairs of subfiles are compared 
together. For each pair, the number of page accesses (computed in the same way as in 
the LP AP strategy) is multiplied by the Joint resolving power of all the selection 
attributes in the pair of subfiles. The pair with the smallest product is chosen. The 
strategy is then applied to the remaining subfiles. Compared to the HSLP strategy this 
strategy performs a search of depth two and hence will result In superior methods than 
those found by the HSLP strategy. 

We have conducted a number of experiments on the above five subfile sequencing 
strategies. The experiments varied over two different sets of query usage patterns, two 
partitions, three sets of attribute lengths, and three sets, of attribute selectivity The results 
given in the table below are the average for each strategy's performance. The two strategies 
Exhaust and Random are "control" strategies against which the other strategies are to be 
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compared The Exhaust s**a*£y finds the optimal sequence of tutrflles by exhaustively 
enumerating all sequences, and selecting the sequence which results in the least processing cost 
for the query. The Random strategy finds a sequence for the subfiles by randomly choosing 
one of the possible subfile sequences, in what amounts to a non-strategy. The first row of 
Table 1 is the ratio of the average page accesses to each strategy with respect to the page 
accesses of the Exhaust strategy. The second row is the ratio with respect to the Random 
strategy (for the same set of experiments). 

The performance of the Least Page Access by Pairs strategy was very close to the 
optimal performance. By the performance of a strategy K» 4nean the cost of answering the 
queries in the usage pattern when each query is evaluated according to the strategy. The 
Least Page Access strategy also compares favorably to the other strategies. The performance 
of the strategies that considered the joint selectivity were not as good as the LPAP strategy. 
Even the LP A strategy, which only considers the number of page accesses, performed better 



1.0 1.425 1.103 1.004 1.288 1.246 1.055 

O.701 1.0 0.773 0.704 0.903 0.875 0.740 

Table 1 The results of different query evaluation strategies. 
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than the HSLP strategy that considered both the page accesses and the selectivity. We 
attribute this partly to the fact that after the first subfile has been linked, the Joint selectivity 
of the resolved attributes has become high enough so that the second subfile incurs 
comparatively fewer page accesses than the first subfile. Thus it becomes Important that the 
first subfile incur as few page accesses as possible. 

The LPAP strategy is very close to optimal and may be considered as the choice 
for a query evaluate* tn a partitioned database env i ro nm e nt. However In our work, we have 
chosen the LP A strategy because of the following reasons. 1- The LP A strategy Is 
near-optimal. 2- The LPA strategy Is computationally efficient compared to alt the other 
strategies. Since the number : 'of pages accessed In Imkmg from a TID list to a subfile is 
inversely proportional to the subfile's blocking factor (again, refer to Section 12 and 
expression 4.2.l.b), a query evaluator based on the LPA strategy initially has to order all the 
subfiles of a partition according to decreasing blocting factor. For each partition, the subfiles 
of the partition need to be ordered only once. Thereafter when evaluating a query, the query 
evaluator sequences the subfiles that contain unresolved selection attributes in accordance 
. with the precompiled sequence based upon the subfile blocking factor. 

The figures of Table 1 are p e r form a n c e averages over different queries, partitions, 
attribute lengths, and attribute selectlvttles. Obviously, some strategies perform better than 
others for certain queries and partitions. It was observed that tn general, as the number of 
attributes in the selection components of queries increases, the performance of each strategy 
deteriorates with respect to the Exhaust strategy, with the strategies that consider only a 
single subfile (the LPA, HSS, and HSLP strategies) deteriorating the most. Also, it was 
observed that the larger the number of subfiles in the partition, the less optimal the 
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performance of the various strategies. 

The above discussion concerned conjunctive queries. For disjunctive queries, the 
query evaluation strategy u«*d U very »mjj*r to the strategy, wed tt* conjunctive queries. If 
the indexed selection attributes are highly selective, and if the subtle contiiiniag the indexed 
selection attributes also comain uairuiexed se4ecaoo attributes, then with great likelihood, this 
subfile will be searched ill iH» entirety and wU* i»dtos**M**ee w^ 
avoided. Otherwise, the fuU set of apphcahle 4nd*6» 4s used. ,.,f*T4i§ subfile containing 
unresolved selection attributes are then sequenced according to the LP A strategy (i*. 
according to decreasing blocking factor). For a disjunctive query, the Joint selectivity of the 
resolved attributes is. computed accofding to expmsten 3.62 fron* the individual attribute 
selectivities, 

A disjunctive query is resolved differently from a conjunctive query in the query 
resolution phase: When a ftO Ust is obtained bf Unking to a subfile, the union of the new 
TID list is taken with the old TID Hst The resulting TID, Mat is. then oainpJementrd to 
obtain a list of subtuple TIDs Jthat do not satisfy any of the atUib ot ei re so lv e d to far. This 
complemented TID list is used when linking to the next jubftte in the method. 
Complementing a TID list is accomplished by repeatedly generating subfile TIDs using 
expression 3.2.2 and checking to see that a generated TID does not occur in the TID list. 

After the query evaluation phase, the query's method is passed to the query resolver 
which actually produces the TID list of selected subtuptes. The TH? Hat., t» then used for 
linking to subfiles containing projection attributes. Depending on the transaction type, the 
query answerer does the following: 
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1- Query - The subfHes containing the projection attributes are linked from the TID list 
constructed at the query resolution phase. The selected subtuples are retrieved from the 
subfiles, and the values of projection attributes are extracted and returned. 

2- Update - The selected subtuples are retrieved from subfiles containing projection 
attributes (as for a query), all attribute values to be updated are updated (in primary 
memory), and the subtuples are written back in their previous location. An update 
incurs as many page accesses as a query in the resolution phase, and twice the number of 
page accesses in the answering phase. If any of the updated attributes are indexed, then 
the affected .indices are maintained as appropriate. 

3- Deletion - All co-subtuples of the selected tuples are retrieved, marked deleted, and 
written back in their previous locations. A deletion incurs the same cost as a query in the 
resolution phase, and twice the cost of retrieving aM co-subtuples (i.e. the entire tuple) of 
selected tuples in the answering phase. The affected indices are maintained as 
appropriate. An overflow garbage collection may ensue if there are too many deleted 
tuples in the file. 

An insertion is different from the other transacttons. Assuming that the unused 
tuples are uniformly scattered throughout the file, inserting r tuples in the file incurs 
twice the number of page accesses required for retrieving r uniformly distributed 
subtuples from each of the subfiles. This number is computed from the page access 
function of Chapter 4. If the unused tuple stott in the file have been exhausted, then the 
excessive inserted tuples are appended to the end of the file. In this case, the number of 
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page accesses incurred for each subfile will be the number of appended subtuples divided 
by the blocking factor of the subfile. Indices are maintained as appropriate. 

We note here that the optimal attribute partition is independent of index 
maintenance, and the cost of maintaining the indices is incurred regardless of the choice of 
partition. Also we have taken the two problems of index selection and attribute partitioning 
as separate, assuming that the set of Indexed attributes is fixed. Therefore, index 
maintenance cost will not enter our objective oatt function, and we may eliminate it from 
further consideration. 

6. Parameter AoauigUlon 

The Parameter Acquistter monitors the database ma nag e me n t system and collects 
statistics both on the usage pattern and on the response of the database management system 
to the queries. The statistics collected are used to forecast database and usage pattern 
parameters for the next time interval A time interval is the time span between two 
consecutive repartitioning points. The forecasted parameters wiM be used by the tile cost 
estimator and the attribute partitioning heuristics at the repartitlpiiing pomt marking the end 
of the time -interval. Monitoring the database man age m e nt system is a feat tune activity; it 
has to be performed while the database management system proc es ses transactions. For this 
reason, only those statistics that can be inexpe n si vely acquired should be col lecte d . Abo, the 
statistics collected must be succinct and require little storage for their preservation. The 
statistics collected for the purpose of attribute partitioning faM into lour general classes: 
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1- Database Usage Statistics - For each query made to the database, the type of th? query is 
stored in a table of query types. The type of a query is the set of attributes in the 
selection component and the set of attributes in the projection component and a flag 
indicating whether the query is conjunctive or disjunctive. Consequently, all queries with 
the same attributes (but with possibly distinct attribute values to the equality condition 
predicate) are clustered together in It* safik ellry of tn* table. (A query type may be 
encoded as a bit map for the sake of succinctness.) CNir assumption that the fraction of 
tuples satisfying an equality condition predicate depends only <on the selection attributes 
and not on the attribute values tn the selection component make* this clustering scheme 
possible. The number of queries that aW clustered in the query type It recorded along 
with the query type in the table entry. 

2- Average Relation Size and Average Blocking Factor - The number of tuples in each file 
is needed for the purpose of cost analysis. This statistic is continuously updated by the 
number of tuples inserted or deleted so that it reflects the instantaneous size of the file. 
The blocking factor of the file (the number of tuples per page) Is also required for cost 
analysis. The blocking factor at a certain point in the time interval is the number of 
tuples in the file divided by the number of pages in the file at that point in the time 
interval. The number of pages in the file is also updated continuously as pages are 
allocated for inserted tuples or as pages are released after garbage collection, so that it 
reflects the true state of the database. Since tuples WIN be continuously inserted and 
deleted, while some tuples will be temporarily unused (until a tuple Is inserted in place of 
a deleted tuple or until the next overflow garbage collection occurs), si fixed value for the 
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blocking factor over the time Interval will at best reflect an average of the true blocking 
factor. The average blocking factor parameter to obtained by averaging the blocking 
factors observed at a number of points in the time interval. 

3- Attribute Selectivity Statistics - This statistic is the fraction of tuples thajt have 
historically satisfied an equality condition predicate on the attribute. To compute the 
selectivity of an attribute, the parameter acquisfcor records the number of times the 
attribute occurs in equality condition predicates of queries, and Car each such query, the 
parameter acquis! tor records the fraction (or an approximation thereof) of tuples that 
satisfied the equality condition. The average of these fractions is thus the attribute 
selectivity measure. Below, we describe how the fraction of se le cte d tuples is deter mined. 

Let «Tj, be the fraction of tuples that satisfy an equality condition predicate 
involving the ith attribute and occuring in the Jth query . the attribute will be resolved 
by either sequential searching, indexing, or linking. If the attribute is resolved by 
sequential searching (i.e. the subfile containing the attribute Is searched in Its entirety), 
then <r, ( can be precisely calculated as the ratio of the number of tuples satisfying the 
equality condition predicate n, to the total number of tuples. If the attribute is resolved 
by indexing, then a T1D list will be obtained that points to the selected tuples, and * H is 
precisely the ratio of the site of the TID list to n. ff "the 'attribute is resolved by linking, 
then e-jj has to be calculated in a reduced tuple space and then extrapolated to the entire 
space. This is because linking b performed mm a reduced set of tuple TIDs, which 
have been identified beforehand, in order to get a further reduced set of tuples that 
additionally satisfy the predicate. Depending on whether query j is conjunctive or 
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disjunctive, the estimation of e it will be done as follows. 

(a) Suppose the equality condition appears in a conjunction of L equality conditions of 
the form: 

C, A Cj a _ a Ct 

where C, is an equality condition Invoking attribute •-, . (The order of the equality 
conditions above reflects the order the predicates are sequenced in the query's method.) 
Let no be the total number of tuples in the relation, and let n { be the number of tuples 
that satisfy C, a c 2 a ...... a C, . (Note that these numbers are readily available from the 

query processor when it resolves the transaction.) <% for query j can then be 
approximated as: 

(ta) Suppose the equality condition appears in a disjunction of L equality conditions of 
the form: 



C| v Cj v — v CJ, 

where C t is an equality condition involving attribute •( . (The order of the equality 
conditions above reflects the order the predicates art sequenced in the query's method.) 
Let no be the total number of tuples in the re lattoa, and let n, be the number of tuples 
that satisfy -<5|A-ft a _ a-^jaCj . (Again, these numbers are readily available 
from the query prpcetsor.) The fraction of tuples satisfying £» of query j can then be 
approximated as: 
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The attribute selectivity s, for attribute «, may new be .cpinnuJted la* the average of 
^ forallHQ : . f -^ 

1 . • 

tot 

where Q is the set of queries made to the database during the previous time interval. 
By averaging the fraction of tuples satisfying the actual occurences of an attribute in the 
queries, we have taken into consideration both skewness in the distribution of attribute 
values (for the attribute) in the file as well as skewness In die distribution of attribute 
value occurences in queries. 

The selectivity of an attribute should change if either the distribution of its values in 
the file changes, or if the values in the equality condition predicates of queries involving 
the attribute change. Since the above changes occur when tuples are inserted, deleted, or 
updated and also as the database usage pattern evolves, the attribute selectivity measures 
need to be continuously updated to reflect the recent and more accurate information. 
The attribute selectivity measure is kept up to date by ma i nta ini ng a running average of 
each attribute selectivity, as the fraction of topics satisfying an equality condition 
predicate on the attribute fr calculated in the pfdeess of query resolution. Every time a 
search is done on an attribute of a Ote (or subfile), the attribute's selectivity ft updated by 
the weighted average of theoW selectivity and the fraction of the tuples selected in the 
search. 

After the individual attribute selectivities have been obtained, the Joint conjunctive 
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or disjunctive attribute selectivlties may be computed from them.. Our assumption of 
independence among attribute value occurences in tuples leads us to simple formulas for 
the joint selectivities. The expected fraction of tuples that satisfy a conjunction of 
equality conditions simultaneously is equal to the product of the individual expected 
fractions that satisfy each equality condition. The joint conjunctive selectivity of a set of 
attributes I , each with selectivity «| is: 

(3.6.1) n u| s. 

Similarly, the expected fraction of tuples that satlsy a disjunction of equality conditions 
simultaneously is the complement of the fraction expected not to satisfy any of the 
equality conditions in the disjunction. The joint disjunctive selectivity for a set of 
attributes I , each with selectivity s ( is: 

(36.2) i-n l€l u-s,) 

4- The last statistical information needed is the performance cost of the partitioned database 
in the current time interval. This is the cost (in terms of the number of page accesses) 
incurred when the database management system answers all the queries in the usage 
pattern. This statistic is the sum total of the number of page accesses made in answering 
queries since the last repartitionlng point. The parameter acquisitor updates this figure 
everytime a query is made to the database. This statistic is used to determine the extent 
to which the partitioned database performance cost comes close to the performance cost 
that had been estimated at the previous repartitioning point. If the partitioned database 
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performance cost it not within a reasonable distance from the p e rforma nce cost that was 
forecasted for it, then it may be concluded that Hie current usage pattern no longer 
reflects the forecasted usage pattern and that dw currant attribute partition is no longer 
suitable. If the database performance is diagnosed as such, then a repartitioning of the 
database may be initiated. 

When repartitioning Is initiated at a repartitioning point, the parameter acquisitor 
takes the statistics collected in the time interval since the test repartitioning point (and also 
the statistics collected during previous time intervals) and forecasts parameters for the time 
interval up to the next repartitioning point Specifically the parameter acquisitor forecasts 
the following parameters. 

(a) The frequency of occurence of each query type. 

(b) The size of the relation and the average blocking factor. 

(c) The average selectivity of each attribute. 

A thorough discussion of the exponential smoothing forecasting technique that 
should be used for the purpose of predicting the above set of p a r am e t e r s appears in [153 and 
[T} ; we will only give an ourhne here. Intuitively, exponential smoothing uses a weighted 
moving average that is based On two sources of evidence: the most recent observation and 
the forecast made previously. The new forecast is equal to a percentage (known as the 
smoothing constant) of the recent observation plus the complement percentage of the previous 
forecast. Exponential smo oth ing has a number of advantages roduding simplicity of 
computation, minimal storage requirements, adjustability for responsiveness, and 
genera lizabiiity to account for trends. A variant of exponential smoothing known as adaptive 
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forecasting may also be used. This technique takes trends in the parameters into account. It 
is an efficient technique and more reliable than exponential smoothing, and may be preferred 
to simple exponential smoothing hi some cases. For a thorough discussion of the different 
forecasting techniques, the reader is referred to the two wprks mentioned above. 

7. Repartltionlng Points 

Database repartitioning points may be determined m several ways. Repartltionlng 
may either be initiated by the database administrator whenever the database administrator 
deems necessary, or may be initiated by the parameter acquisitor. One way to have the 
parameter acquis! tor itself initiate repartitioning is to require It to prepare at each 
repartitioning point a forecast of the usage pattern and the database parameters for a 
number of periodic checkpoints into the future. For each checkpoint, the performance cost of 
the partitioned database is forecasted. During the course of monitoring the database 
management system and the performance of the partitioned database, whenever a checkpoint 
is reached, the parameter acquisitor compares the observed performance cost with the 
performance cost forecasted at the previous repartitioning point for that checkpoint. If the 
observed performance is inferior to the forecasted performance by a margin that is not 
acceptable, the parameter acquisitor may conclude that the current partition is no longer 
suitable for the current usage pattern and should then initiate repartitioning. When 
repartitioning is initiated, forecasts of the usage pattern and database parameters are 
prepared for a number of periodic checkpoints into the future. (Finding the optimal set or 
checkpoints is itself another database optimisation problem. We refer the reader to a brief 
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discussion of the problem of eptinwl determination of repartitioning points presented in 
Section 7.1 ) The attribute partitioning heuristics are then invoked to find a suitable partition 
that is optimal or near-optima} for *h* forecasted u»pe pattern. If the proposed partition is 
different from the current partition, the current attribute partition »> alio «ost estimated for 
the forecasted usage pattern. If the cost of the proposed partition it less than the cost of the 
current partition by a marf in that justifies repartitlonmg, the repartitioning of the database 
is carried out. 



;;*- 1 ;- .V>:. .v'-r^^i,*^ 



Chapter 4 -73- Cost Analysis and the FCE 



COST ANALYSIS AND THE FILB COST ESTIMATOR 



In this chapter we will analyze the cost of resolving and answering a query made to 
the database and describe how the fite cost estimator derives the total system performance cost 
for a given partition and a set of queries. (Henceforth, we wilt use the term partition 
performance cost to mean the performance cost of the database management system 
partitioned in a specified way, in response to the queries in the usage pattern. Also, we will 
use the term evaluating a partition to mean the derivation of the partition's performance cost 
by the file cost estimator.) Each of the attribute pardoning heuristics repeatedly calls upon 
the file cost estimator to evaluate partitions they propose. Thereafter, they select the partition 
with the best evaluation and based on it propose another set of partitions. Each proposed 
partition is cost evaluated by the file cost estimator and the partition with the best evaluation 
is selected. This process of deriving a set of partitions and selecting the best partition is then 
repeated. By this process, the heuristics try to propose partitions that result in successively 
better evaluations. So in a sense, the file cost estimator may be viewed as our objective cost 
function, which the heuristics proceed 10 minimize by proposing better and better partitions. 

We wilt assume throughout that internal processing costs (CPU- costs) are 
insignificant and the performance of the database management system we model is bounded 
by input/output operations (page read and writes) and hence that page accessing cost 
dominates all internal processing costs. Internal processing costs include the costs of query 
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evaluation (assumed to be negligible) ami of ootamirtg imer-secttoro and unions of TID fists 
in query resolution. We assume that forming trw iatanestkin a^^unioB<rf TH) Wsts can be 
entirely done in primary memory and so don not incur any page access to storage devices. 
Accessing the index of an attribute and retrtevittg the TID list of the index, do incur page 
accesses. Therefore, we include these cos» in computing the partition's performance cost. 

We do not consider data storage easts tit the partt$iaf& evetuatton. This is because 
if page breakage at the end of a subfile U %nored, the ainoum of fto^ 
attribute partition of a file Is the same. No matter how the file Is partitioned^ the storage area 
required for that file wilt be the same as that for the one-file paragon jphwi an., insignificant 
number of pages due to page breakage. When repartittonmg an attribute partition, for each 
subfile at most one page can remain unfitted; «fe»«hfj^ one 

partition to another cannot exceed the maxifnum number of subfiles in the two parUtions 
Since this figure is usually insignificant compared to the total number of pages reqvi red to 
store the data, we may safely ignore page breakage .a^.^hen||, f t|ora|^ ,co|U Ti^ 
consideration in the evaluation of a partition. 

Based upon the above assumptions, the petformar^ce cost .qf , ^partition will, be the 
cost of accessing the subfiles in order to answer the queries in the usage pattern. Since page 
access cost is proportional to the number of page accesses, our cost analysis will solely be 
concerned with the number of page accesses Incurred in answering a query- Before we 
discuss the file cost estimator, we give the page access analysis for each of the sequential 
search, linking, and indexing access paths. 
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L Sequential Search 

If a query's method does not specify any Index to be searched (this happens if none 
of the query's selection attributes are indexed, or if the query evatuator deems the indices 
useless for resolving the query), then the first subfile of the method has to be sequentially 
searched in its entirety (the rest of the subfiles will be searched using links). In a sequential 
search, all the pages of the subfile are retrieved, and their subtuples are matched against the 
attribute value specified in the query selection predicates, and the TIDs of qualifying 
subtuples are stored in a TID list Iff, Is the subfile that U being sequentially searched, n 
the number of tuples in the subfile, and b, the blocking factor for Fj, then the number of 
page accesses will be equal to the number of pages in the subfile: 

rn/b,1 

The blocking factor b, is equal to the system page size 8 divided^, by the tength of the 
subtuple. If A, is the set of attributes in subfile F,,and 1} the length of attribute a, then: 

b,-LS/2 I, J 

2. Tuple Retrieval Using Links 

Assume that we have a list of TIDs pointing to the subtuples of a subfile. We want 
to compute the number of page accesses incurred in retrieving the subtuples with TIDs in the 
list. (Such a TID list might have been obtained either by a sequential search on a subfile, by 
following a previous link to another subfile, by indexing on an attribute, or by forming the 
.intersection or union of TID lists obtained in any of the previous ways.) hi any case, an 
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estimate of the number of TIDs in the list (which is equal to the number of tuples to which 
they point) is readily available from the Joint selectivity of the attributes that have been 
resolved so far, whose resolution has resulted in this TID Hit. If t is the joint (conjunctive 
or disjunctive, depending on whether the query is conjunctive or disjunctive) selectivity of all 
attributes that have been considered in the creation of the TID Hit, and if n is the number 
of tuples in the subfile, then the length of the TID list is approximated by $ * n . 

Our cost criterion for performance optimization is the number of page accesses; In a 
paged memory environment in which tuples ire blocked together In pages, we have to 
translate the expected number of tuple accesses to th* expected number of page accesses. The 
expected number of pages to be accessed is always less than or equal to the number of tuples 
to be accessed because two or more tuples may reside on the same page In our model of the 
database management system, finding the expected number of page accesses is relatively easy 
because of the following properties, which hold as a result of the assumptions we have made 
about our file and Index models: 

1- The TID list is ordered. Whether the T© list to obtained by sequential searching, 
linking, or indexing tt is ordered (i.e. sorted m tncreasmf or decreasing value) and 
subsequent intersections and unions preserve thti ordering. '^^pl l oplirtf5of the TID 
list assures that each page of a subfile is retrieved at most once (since the tuples are 
retrieved in the sequence they reside in the subfile. This property also eliminates the 
need for large buffer areas in primary memory to a ccomodat e input/output operations, 
since at any instance, at most one page wiH be in primary memory.) 

2- The TID list is not redundant; i.e., no TID appears more than once in the list. 
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3- The TIDs are distributed uniformly over the TID space. This property! is assured by 
our assumption in Section 3.4 of independence among the occurence of attribute values 
in tuples. Hence all TIDs appear with equal probability In the TID list, and the tuples 
they point to*are scattered uniformly throughout the subfile. 

These three properties of our mode} of tuple access makes the translation of the 
number of tuple accesses to page accesses relatively simple. Based upon these three 
properties, Yue and Wong [40] have dertved the number of p»ge accesses from the number 
of tuple accesses in terms of the recurrence relation 44J4,;. 

(4.2.1.») Ainjbfl) ' 

n-r-b n 

(A2.1.M A(n,b,r*l) Mnjbf) ♦ — 

iw n-r 

In the above formula, A(njb,r) Is the expected number of. pages accessed from a file (subfile) 
with n tuples (subtuples) and b tuples (subtuples) per page when retrieving r tuples 
(subtuples). (Note that r-s*n in the cost analysis above.) The computation of 4.2.1 
involves on the order of r multiplications and r divisions, and is therefore quite expensive 
to compute. By the technique of generating functions, we have solved the recurrence relation 
4.2.1 and have obtained the closed form solution 4.2.2 for the number of page accesses. 
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r» 
(4.2.2) A<n,b,r) - - 1 - 

b . 



C) J 



The above formulation has the advantage over the recurrence relation that it can be 
computed more efficiently. A detailed derivation of the formulation 4.12 (hereafter called the 
page access function) may be found in Appendixl of (71 (Waters [36]and Yao [37] have also 
independently arrWed at Ini^page 'access lltiid^^lia^ffti hy p e rge o met ric distribution of 
probability theory.) The formulation 4.2.2 atio admits of a simple interpretation. For an 
arbitrary page in the file, the probability that it does net contain any of the r selected tuples 
is the number of ways of choosing b tuples from n - r tuples, divided by the number of 
ways of choosing b tuples from n tuples. Hence the expected number of page accesses will 
be the number of pages (n/b) times the complement of the above probability. 

During the course of attribute partitioning, the attribute partitioning heuristics 
repeatedly call upon the file cost estimator to evaluate partitions. Every time the file cost 
estimator evaluates a partition, it has to estimate the cost of answering each of the query types 
in the table of query types. Estimating the cost of answering etch $jery type involves 
computing the number of page accesses incurred in accessing each of the subfiles that 
contains an attribute in the selection or projection components of the query type. Since for 
each such subfile, we have to compute the page access function, it is important that the page 
access function be computed as efficiently as possible. The page access function 4.2.2, if 
expanded, will take on the order of mm(b,r) multiplications per computation. Although the 
page access function is much more efficient in computation than the recurrence relation 4.2.1 
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(since b is usually much smaller than r), computing the page access function in its exact 
form 4.2.2 is still too costly for our purposes. Instead, we use the following approximation to 
the page access function (suggested by Michael Hammer) in our file cost estimations: 



i[ 



x 
n-(b-l)/2 -\ 



(4.2.3) A(n,b/> P( -ll-f 

The approximation 4.2.3 has proven to be very fast, taking only a constint number of 
multiplications and divisions per computation, and has the advantage of extreme accuracy for 
almost every combination of n, b, and r M. 

3- Inttox Ace— Inp and TwU B^trte)v»l 

Using the index of an attribute of a file (or subfile), in order to retrieve tuples that 
have a given value for that attribute, is composed of three steps. The first step is accessing 
the non-leaf pages of the index to get a pointer to the Til) list of tuples with the given 
attribute value. The second step consists of retrieving this TID list. The third step is 
retrieving the tuples that the Tim point to by retrieving the pages they reside in the subfile. 
A detailed analysis of indexing costs appear* In Chan C7J and we will not reproduce it. We 
shall only repeat here the final expression derived In 01 The average cost of using an index 
is: 
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flog .. ■■ t<r» ♦ t/*)/u|Sn ♦ ,f •(•■•» ♦ UfaSl + Afcirh. **n) 

where n •> number of tuples in the file 

b -blocking factor of the indexed attribute'! subfile 

L - length of the indexed attribute 

s - selectivity of the indexed attribute 

u n - average fraction of index node page utilization 

u ( - average fractton ofindw leaf page utiHtatJon r " 

S - system page size. 
The three terms of the expression are the respective- cost* of the <hrot indexing steps* T#ie 
last step of index use, i.e. retrieving the qualifying tuples, actually occurs if this attribute is 
the only one whose index: is used in the m e Ctod of the q uery, {if at other cases, the 
intersection or the union of the TTD list obtained from the second step of indexing is taken 
with other f ID lists before the tuples that are pointed to are retrieved. 

4. File Cost Estimation 

The file cost estimator evakwtes • partition p^ 
■and computes the performance m&Jfm t*at partition. The performance cost of each 
proposed partition is estimated by rteratii^ over the o^e^ m the taWe ctf o^iery types and 
{estimating the cost of answering each query. (This table is provided by the parameter 
acquisitor and is a forecast of the database usage pattern for the next time Interval.) Each 
query type in the table is passed for evaluation to the query evajuator. The query evahiator 
uses the Least Page Access strategy and thereby produces a near optimal method for the 
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query, (The Least Page Access strategy, as described in Section S3, sequences the subfiles that 
are to be linked according to decreasing subfile blocking factor.) 

The file cost estimator receives the method for the query. If any index is specified to 
be accessed by the method, the file cost estimator uses the cost expression for index use to 
compute the cost of accessing the indices. If no Index is specified by the query's method, the 
first subfile in the method has to be sequentially searched and the cost of the search is the 
number of pages in the subfile. (The reason that the first subfile must be sequentially 
searched is that initially there is no TlD list on hand that would restrict the search to certain 
pages of the subfile- Sequential searching may be viewed as a limiting form of linking, where 
each page of the subfile has to be retrieved.) In either case, la. if indices are used or the 
subfile is sequentially searched, the Joint selectivity of the attributes resolved so far can be 
readily computed from expressions 3.6.1/3.6.2, depending on whether the query is conjunctive 
or disjunctive. (The set of attributes I in 3.6.1 and 3.6.2 is the set of attributes resolved so 
far.) The remaining subfiles of the method are sequenced and are to be linked in the 
sequence specified by the method. Using the approximation to the page access function, the 
file cost estimator computes the cost of accessing the first of these subfiles. (Observe that r, 
the number of tuples to be retrieved, equals, the product of the Joint selectivity of attributes 
resolved so far and the number of tuples in the subfile n). The access cost estimated for this 
subfile is then added to the cost of indexing/sequential searching. The file cost estimator 
then derives the new joint selectivity figure by including the oW joint selectivity figure and 
the selectivities of all the attributes resolved in this step of the method in expression 3.6.1/3,6.2. 
The cost of linking to the remaining subfiles of the method in the sequence specified is then 
computed for each successive subfile in the same way. The cost of accessing each subfile is 
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added to the accumulated cost, and the Joint selectivity is updated according to the 
selectivities of the newly resolved attributes. When no subfile remains hi the method, the file 
cost estimator wilt have computed the cost of resolving the query and also the joint selectivity 
of the query (and hence the number of tuples selected by the query). The file cost estimator 
then computes the cost of answering the query. Using the approximation to the page access 
function and the estimate of the number of tuples mat are selected by Hie query, the file cost 
estimator estimates the number of pages that need to be retrieved from each subfile that 
contains any of the projection attributes, the only subfile containing a projection attribute 
that does not incur page retrievals in the answering phase is the tost subfile in the method of 
the query. This is because the projection attributes in this subfile can be retrieved as the 
selection attributes of this subfile are being resolved. The cost accumulated in the resolving 
and answering phases is then summed to give the cost estimate for the query. A query's cost 
estimate is then multiplied by the frequency of the query type to get the total cost estimate for 
that query type. Finally, the sum of these weighted query cost estimates is the performance 
cost of the partition (in the context of the forecasted usage pattern). 

The file cost estimator is called repeatedly to the process of attribute partitioning, it 
is imperative that the file cost estimator be implemented efficiently. Note that by clustering 
all queries with the same type into one entry of the query type table, we have already reduced 
the totality of the queries in the usage pattern into a relatively smaller set of query types. 
Hence, the number of the iterations required by the file cost estimator has already been 
reduced. Although further clustering measures like the "nearest centroid" clustering scheme 
of Belford [5] could be employed to still reduce the number of query types, the degree of 
query clustering we have employed has proven sufficient for our purposes. Tests show that 
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the file cost estimator (as programmed in the programming; language |KDL [26] on a PDP-10) 
takes somewhat less than a second of processing time to estimate the cost of a set of 100 
conjunctive and disjunctive query types. Furthermore, when queries are additionally 
clustered, correlation information about attribute occurences in queries will be inevitably lost, 
and estimates based on clustered queries will be less reliable. Therefore further query 
clustering is not advisable. 

5. Repartitioning Coat 

The cost of repartltloning the attribute partition u computed as follows: if at a 
repartitioning point, the new partition has subfiles ¥ MtTV , ,F M In common with the old 
partition, then only subfiles F,, m J t F k have to be relieved, reorganized, and written back 
on secondary storage. The total page accesses required to do this will be twice the number of 
pages in each subfile: 

The above cost is based on repartitioning aH the subfiles F,, _ F k simultaneously. I.e., the 
pages of each subfile are read in sequence along with the pages of the other subfiles, the 
attribute values are then transferred from one subfile to another, and finally the pages are 
written back onto secondary storage. Each page of a subfile is thus accessed only twice, once 
for reading and once for writing. 

At each repartitioning point, the performance cost of the partition proposed by the 
partitioning heuristics for the next time interval has already been computed. The 
performance cost for the current file partition for the next time period Is then computed. The 
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two are compared and if the proposed partition offers a performance cost reduction greater 
than the cost of repartitioning, then the file should be reorganized according to the proposed 
partition. 
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CHAPTER 5 
THE ATTRIBUTE PARTITIONING HEURISTICS 



In this chapter we present a number of heuristics for partitioning the attributes of a 
file. Each attribute partitioning Heuristic starts with il supplied partition and derives from it 
a superior partition (If the heuristic is not able to improve on the supplied partition, the 
heuristic will terminate and return the supplied partition as its result.) Therefore it is 
possible to apply the attribute partitioning heuristics in succession, with each heuristic 
starting with the resultant partition of the preceding one and producing a partition that is as 
good as the preceding partition. We say that a heuristic is relevant to a partition if its 
application will result In an improved partition. 

We have performed a number of experiments on the attribute partitioning heuristics. 
The overall results of the experimentation performed on each heuristic is included in the 

....... ■ X 

discussion of that heuristic. Since our most extensive program of experimentation was 
applied to our main heuristics, we have devoted Section 1 of the chapter entirely to a detailed 
discussion of that subject. Before we proceed to describe the heuristics, we will first establish 
the necessary terminology for the subsequent sections. 

Let P be a partition of the set of attributes A of a file into disjoint subsets. Each 
subset of A is termed a block of attributes; the ith block of the partition is denoted by A f . 
A block of attributes may be viewed as a representation of a subfile, i.e„ when a file is 
partitioned according to a given partition P , each block A, of P is directly implemented by 
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a subfile with attributes drawn form A,. If M is the number of blocks in the partition, then 
P - {Ajtf.i . Aj n A, - # for all i t j , and u^ft, » A . The trivial partition P° has been 
defined previously to be the partition where every subfile contains exactly one attribute. 
That is, P° - {A?}?., /where A? - {•;} . 

L The ExKaust^vo gnusnoj^ofi ^ppffo^o^ 

One way of finding the optimal partition is to produce an partitions of the set of 
attributes, and evaluate each of them with the file cost estimator in order to identify the 
partition with the best performance cost. This exhaustive enumeration approach is not a 
viable partitioning strategy because of the large number of possible ways to partition a file. 
The number of distinct partitions of a set of m elements into disjoint subsets, B(m), is 
known as the mth Bell number. Unfortunately, there is no simple expression for B(m) that 
we can analyze in order to arrive at its complexity. However, Moser and Wyman [27] 
provide an asymptotic expansion for the Belt numbers. This asymptotic expansion is in 
terms of the solution to the equation xe"»m, and hence is not in closed form. From this 
asymptotic expansion it is possible to derive the following asymptotic upper and lower 
bounds for the Bell numbers [281 

(5.1.1) BOrO-eOn 1 ") 

(5.1.2) m (l - 6>m -©(B(ni» , €> 

where < is any non-zero positive real number. (The notation fun} 1 * ©(*(«*»» denotes that 
lim m»<o f(mtyg(m) - 0.) The two asymptotic bounds are verf tight, -afl# from them ww See that 
the number of partitions of a set of m elements into disjoint subsets asymptotically 
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approaches close to m* (or equivalent!?, close to 2* ^" ) *« w approaches infinity. By 
this we mean that < may be takes ,•* smaft a* possible a* long as it J» positive, and B(m> will 
always, grow faster than m u ~* )m . Therefore for aU p«Oical purposes, the number of 
distinct attribute partition* is prohibitively high to i-er^er an exhaustive enumeration 
approach feasible (for the general attribute rjartiUoriU^ problem with any number of 
attributes). As an example, a file containing W attributes can be partitioned into 
BOO) - 115975 different partitions, Arwther problem with the exhaustive enumeration 
approach is that generating. aU the Bt») different partitions is not an easy task. A program 
written for" generating all the partitions, of a set erf attribute* <and which was /used to 
exhaustively find the optimal partition for a number of attribute partitioning problems with 
not more than 8 attributes) required storage space that grew faster than «») • 

The heuristics we have considered in our work and described in subsequent sections 
are all stepwise minimization heuristics. Stepwise minimization is the process of carrying out 
an optimization task in a series of steps. At each step, a cost criterion is optimized to the 
extent possible. Each step that follows carries the optimization still further. Finally, when no 
further optimization can be performed at a step, the stepwise minimization process is 
terminated. In the case of the attribute partitioning heuristics, each heuristic starts from a 
predetermined partition, and in each step tries to come up with a new partition that is an 
improvement over the partition of the preceding step. By improvement we mean that the 
performance cost of the improved partition, as evaluated by the file cost estimator, is less than 
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the performance cost of the previous partition. Once an improved partition is found at a 
step, the next step starts with the rawly foand pattmoti and trie to flnda' stitl better 
partition. This process of Incremental improvement' to continued until no partition may be 
found which is an improvement to the partition considered in the last step of the heuristic. 
The last partition is then returned by the heuristic as me resultant partition of the heuristic 
The intermediate partition found at each step Of the attribute partitioning heuristics will 
depend on the partition of the hist step, the qiiery Fmjoerjdes, and the query types. 

At each step of the attribute partittonmf heuristics we have considered, the 
improved partition is obtained from the partition of the previous step by either I- grouping 
a number of blocks of the last partition together to form a single Mock, or by 2- degrouping 
a block of the previous partition into two or more btecki. The heuristics We have considered 
differ from one another Hi two respects: I- the attribute partition that they initially start 
with, and 2- the manner in which the blocks ate trooped or d o yi o bped in each step. 

In our work, we apply a heuristic to an initial partition until in the course of 
stepwise minimization, the heuristic produces a partition upon which it can no longer 
improve. At this point we may apply a second heuristic to the resultant partition of the first 
heuristic. After the application of the second heuristic, a third heuristic may be applied, or 
even the first heuristic may be reapplied. Since a heuristic always results *» a partition that 
is as good as the partition that it starts wkh, it » always possible to apply any number of 
heuristics in succession and m^er get a partition with a higher perfonnance cost (and 
occasionally get an improved partition). However, some of the heurtotia we consider are best 
succeeded by certain other heuristics. In the discussion of each heuristic, we will make it clear 
if the heuristic performed wetl enough to warrant further investigation, and if so, what other 
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heuristics were tried in combination with it 

Note that one mode of operation we do not consider is trying a heuristic for only one 
or a few steps and switching to another heuristic before the first heuristic produces its final 
resultant partition. Our mode of applying the attribute partitioning heuristics is based upon 
the assumption that if a second heuristic is relevant to an intermediate partition produced by 
a first heuristic (that is, the second heuristic can Improve upon the performance cost of the 
intermediate partition), then the second heuristic will stiff be relevant after the first heuristic 
has. terminated. This assumption is made in order to reduce to a manageable size the 
problem of deciding wWch heuristic to apply next. 

We shall consider a number of heurlstia in the forthcoming sections. However, the 
pairwise grouping heuristic described in the next section U the main heuristic of this Work 
and we will attempt to describe it in futt detail. In bur experimentation, we have found that 
the combination of the pairwise grouping heuristic with a second heuristic (the single 
attribute degrouping-regrouping heuristic) to be sufficient for the purpose of attribute 
partitioning within the context of tne database management system we have considered. 

& Tfre? ' Paji>Fr^ Qyoup^Bg „BtW!tjtt> 

The pairwise grouping heuristic begins with the trivial partition P° , and generates 
all partitions that can be obtained by grouping together pairs of blocks in P°. For example, 
if A m {\, 2, 3, 4} are the attributes of a file, the pairwise grouping heuristic begins with the 
trivial partition of row of Figure 1 and produces all the partitions of row I of the same 
figure. The heuristic then evaluates all the generated partitions with the file cost estimator. 
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and finds the partition (call R P 1 ) whose performance cost is the least of alt the generated 
partitions. In other words, assume C(P) to be the p e rfo r m ance cost Of partition P as 
determined by the file cost estimator. In the first step of the heuristic, the foUowtng 
minimization is performed: 

(5.3.1) min C(P|i,) 

where PJ k - {A?, .... A? u A .^., A°J . Let j and K be the v»hies that minimiie 5.3.J. |f it is 
the case that Q(P} k ) < C(P°) ., then the h^pro*e4partf^ ? ^ first step, 

and the second step of the heuristic begins with partition P 1 ^ f?j A . Otherwise, if it is the 
case that C<PJ k ) * CCP }. the heuristic terminates (with the trivial partition as the resultant 
partition). In general, the ith stfp of the patrwise grouping heurist^ aft/fa jtflh partition 

P M - {AJ-'.Ai 1 ,. A^} (where M^, i* the number of blocks in P*" 1 ), and performs the 

minimization: .,....;' -;,., 

where P\ k - {A[- v Af 1 u AJ-*, ..., A M -' } . Assuming j and k minimize 5.32, and if 

C<PJ k ) <C(P i - 1 ). the heuristic then goes to step i + 1 starting with F* -PJn, H - M-i-l- This 
process is continued until a step (say step L) ir reached for which Cf^ k ) i CCP^^br all j 
and k . At this point, no pair of blocks can be found that grouping them will reduce the 
performance cost, and so P 1 " 1 is returned as the result of the pairw^se grouping heuristic. 

The pairwise grouping heuristic may be depicted in terms of a lattice where each 
node of the lattice is a partition. The top node is the trivial partition and the bottom node is 
the "one-file" partition. An interior node is obtained by grouping together a pair of blocks of 
one of its parents. Figure 1 shows such a lattice for the set of four attributes {1, 2, 3, 4} 
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(from here on we shall use integers to represent attributes). The ith row of the lattice 
corresponds to all the partitions that could be generated by the ith step of the pairwise 
grouping heuristic (equivatently, aH the possible partition* with which the i4th step may 
begin). The pairwise grouping heuristic begins with the trivial partition and produces all 
the partitions that can be reached by following an edge (i*^ all the partitions of the first 
row), it then selects the partition in that row with the best performance cost. From that 
partition, it follows all the edges leading downwards to its children nodes. For example if the 
second partition from the left is the best partition of raw 1, then in the next step the heuristic 
Row 
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Figure 1 The lattice of partitions. 
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would compare the first, fourth, and sixth partitions of row 2. The three partitions of the 
second row are alt, in a sense, desirable partitions in that they have been derived from a 
partition that has previously been proved superior. SpeciffcaBy, the heuristic assumes that 
the optimal partition is either among the three partitions or is somewhere below them in the 
lattice and can be reached by going down the edge from the best of the three partitions; The 
heuristic continues to go down the lattice until none of the partitions examined in a row 
reduce the performance cost. At this point, the current parent partition is returned as the 
resultant partition of the pairwise grouping heuristic 

The resultant partition of the pairwise grouping heuristic is not necessarily the 
optimal partition. Only a small subset of all partitions are actuaHy examined by this process. 
On the other hand, at each step, the heuristic does select the best partition among a set of 
partitions, whose common parent was itself selected as best of. a simitar set of partitions. 
Hence, the resultant partition is optimal among a subset of all the possMMe partitions and 
locally optimal among all the partitions of the btttee. The stepwise minimization nature of 
the pairwise grouping heuristic is apparent from the discussions above. At each step, we 
minimize the cost for a subset of partitions that have been selected on the basis of a similar 
minimization in the previous step. 

The motivation behind pairwise grouping is « follows: Initially, when all attributes 
are separated in the trivial partition, thesequerie* that request two attributes are answered 
with close to. minimum cost, white those queries requesting more than two attributes are very 
costly to answer because their attributes reside in different subfiles and hence on different 
pages. Subsequently, as blocks of attributes are grouped, queries requesting a small number 
of attributes become costlier to answer because accessing the attributes will bring in those 
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attributes that are not requested by the query but nevertheless reside in the same subfile as a 
requested attribute, while those queries requesting many attributes, of which all or some are in 
the same subfile, become less costly to answer. In the process of grouping blocks together, a 
point will be reached where the reduction in cost of answering those queries that are 
benefited by the grouping will not offset the increase in cost of answering those queries that 
become costlier due to the grouping. This point is a local minimum of the performance cost 
function. 

The Bond Energy Algorithm of McGormick et al. [24] is another stepwise 
minimization heuristic that may be used for the purpose of attribute partitioning. Hoffer and 
Severance [19] have used the Bond Energy Algorithm to group attributes into blocks based 
on the similarities of attribute occurences in queries (see Chapter 2 for a detailed discussion of 
how Hoffer and Severance 093 utilize the Bond Energy Algorithm for the purpose of 
attribute partitioning). We believe that our pairwise grouping heuristic, when compared to 
the Bond Energy Algorithm, has a nurnber of advantages which makes it more desirable as a 
partitioning vehicle. The Bond Energy Algorithm operates by permuting the columns of a 
matrix consisting of pairwise attribute access similarity measures in such a way that the 
columns of similar attributes fall close together. If we look at the matrix of pairwise access 
similarity measures after the algorithm has terminated, we will find that the attributes are 

. ' ■ '■■ ■'■*,'■-"-.■;- .- <;\.: Si :'--.' ." .-'■':'■'*'-: '..■■■■ 

I i ...... . 

ordered such that similar attributes are placed adjacent or nearly adjacent to one another. A 
disadvantage of this algorithm is that after this is accomplished, i.e. after such an ordering of 
the attributes is found, it is left to subjective Judgement to decide how to clump the attributes 
together to form blocks. The other disadvantage of this algorithm (and one which will be 
rvamtned in Section 4) is that the algorithm only looks at the similarity of access between 
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pairs or attributes (i.e. between pairs of blocks, each of one attribute} rather than among any 
number of blocks containing any number of attributes. 

Stepwise minimization is basically a hill climbing heuristic search technique which 
will not necessarily locate the- optimal solution. The solution achieved using this technique 
may be any of the local minima; . the closeness of the solution to the optimal partition will 
depend on the database parameters, the access paths of the filet, and the usage pattern 
parameters. However, in the course of our ex p e rim en tat ion with the pairwise grouping 
heuristic (to be described in full detail In Section 5.7), the pairwise grouping heuristic starting 
with the trivial partition has consistently resulted in either the optimal partition or in a 
near-optimal partition that differed insignificantly from the optimal partition. This has led 
us to believe that pairwise grouping is an attractive heuristic search technique for finding an 
adequate partition for the attribute partitioning problem. 

The process of pairwise grouping is actually the method of steepest descent of the 
hill climbing heuristic search technique. The coordinates of a point on the "hill" (which 
should be visualized as inverted, since the search is for finding the minimum point) are the 
partition and the performance cost of the partition as determined by the file cost estimator. 
The distance between two partitions is defined as the number of edges on the minimum path 
connecting the two partitions in the lattice of partitions. Pairwise grouping is the process of 
following the negative gradient from one point to an adjacent point with a distance of one 
(along the partition axis), beginning at the point of the trivial partition. Our conclusion 
from this program of experimentation has been that this "hill* is predominantly devoid of 
"bumps" (i.e. local minima or points where the gradient changes sign and all adjacent points 
to the "bump" have a larger performance cost). The few "bumps" that occur on the hill 
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happen to be very close to the top of the "hill" in such a way that these "bumps" are not 
significantly lower than the top of the "hill" itself (the global minima). In other words, it 
appears that the "hill" which is the performance cost of the partitioned database is rather 
"smooth" and the "bumps" that Indicate partitions that are locally minimum all occur near the 
highest point of the "hill". 

As we will fully report in Section 5,7, the performance of the pairwise grouping 
heuristic is extremely good, managing to improve the performance of the database 
management system by a factor of 5 with respect to the one-file partition (le., the 
unpartitioned file). The resultant partition of the pairwlse grouping heuristic was in all cases 
either optimal or near optimal. To determine the extent to which the partition produced by 
the pairwise grouping heuristic is optimal (by comparing it with the optimal partition), we 
exhaustively produced the optimal partition for cases where m was less than or equal to 8 
(B<8) - 4140). For cases where m was greater than 8, we either manually generated 
promising partitions, or applied other heuristics to the resultant partition of the pairwise 
grouping heuristic. In all cases, the resultant partition of the pairwise grouping heuristic was 
either the optimal partition, or was within a few percent of it In the few cases where the 
resultant partition did not coincide with the optimal partition, it was observed that some 
blocks in the resultant partition contained one or more spurious attributes that,, if removed 
from their blocks, improved the performance. These spurious attributes, it turned out. were 
inserted in their corresponding blocks rather early in the minimization process. That is, 
although a spurious attribute resulted in performance improvement early in the grouping 
process (in other words a spurious attribute was attractive to its block), later on, as more 
attributes were added to the same block, the spurious attribute was no longer attractive to the 
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augmented block, and degtvupmg it from its block or transferring it to other blocks became 
beneficial. In Section 6 we will describe other Stepwise minimization heuristics, some 
intended to eliminate this deficiency of 1 the palrwue grotipmg heuristic. 

The major disadvantage of the palrwise grouping heuristic is its slowness. Every 
partition that the heuristic produce* by grouping a pair of blocks need* to be cost evaluated 
by the file cost estimator Let us see, In the course of carrying oot the. mmimt ration 53.2. how 
many partitions need to be cost estimated, in the analysts below, we have assumed that if the 
f i le has m attributes, Mien oh the average the heuristic will iterate for f mft 1 steps. This Is 
a reasonable assumption hi that the heuristic may iterate anywhere from I to m - 1 steps. 
(Our analysis w«l still hold as long as the number of steps is a constant fraction of m.) The 
improved partition produced at a step has exactly one less block than the partition of the 
previous step, i.e. Mj-M-,1, - 1, and the 4th step wW start with a partition that has 
Mj_i - m - i + l blocks. Under this assumption, the final partition produced by the heuristic 
has L m/2 J + 1 blocks: there ate fmfZ 1 - 1 steps that produce an improvement and a final 
step that produces no improvement. Therefore at the tth step, there wiH be 
binomial(m-i+l, 2) partitions generated that need to be cost estimated. On the average, the 
total number of file cost estimations will be: 



rm/21/m-l+l\ 

! w \ 2 ; 



The above expression is of order m 3 . This order is quite large, and it is desirable to reduce 
the partition search cost by doing either of the following two things: I- reduce the cost of file 
cost estimation by evaluating partitions in some other manner, or 2- reduce the number of 
file cost estimations by cost estimating only some of the partitions generated at each step of 
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the heuristic. The next section wilt investigate the pouibltty of cost reduction by employing 
means other than file cost estimation. Section 5 wttt investigate the possibility of reducing the 
number of partitions that need to be cost estimated. 

4. Filo Cost Jfil^^p ap ft MwHWrf 9f Pl<?pk Attrt>9tiyity 

Assume we have the partition P - {A ( ,__, Ay} . Define the (n-wise) block attractivity 
measure u of {A (| ,..„, A, } c p to be the cost reduction (or increase) obtained by grouping 
the n blocks together: 

^({A^ ,...., A, b H P) - «P) - «{Ai ,«, A,.u.-uA, ,_, A M }) 

If °4j\ ••-•» \)> p > '» positive, then partition P may be improved by grouping A Jf , A^ . 

If oi({Aj ( Aj^}} P) is negative, then grouping {Aj ,^., Ai } will increase the performance 

cost. The pairwise grouping heuristic cost estimated all partitions P, h generated by 
grouping together blocks A, and A), of P in the ith step, and selected the partition with 
minimum cost. Since C(P) was available from the previous step, cost estimating all P, h 
where 1 s j < K s Mj_, is equivalent to finding all the block fttractivities <*({Aj , A*}* P) , and 
minimizing C(P,„) is equivalent to finding the maximum value for oc<{A, , A k } ; P) where 
lsj<Ks Mj_| . The two blocks Aj and' A* that maximize «* are then grouped in going 
from the current step to the next step. (They -are grouped onjy if «<({A|, A*}! P) > 0.) From 
this discussion, we see that a step of pairwise grouping may be viewed as the process of 
finding the pairs of blocks in a partition with maximum attractivity. . 

In the previous section we noted that one way the pairwise grouping heuristic may 
be made more efficient is by refraining from file cost estimation. That is, by computing the 
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block attractiv ity measure for all pairs of Mock* without resort to file cost estimation. 

In order to avoid computing «c , we may try to find an approximation to it that is 
easily computable. An approximate block attractivity measure must possess certain qualities 
of the (ideal) block attractivity measure «e tn order to guarantee that the outcome of the. 
pairwise grouping heuristic using the approximate miftitire is th« same is the outcome using 
the (ideal) block attractivity measure (the reason we call u the Meal measure is that no other 
measure can be better than true file cost estimation). Below we give a number of properties 
that a candidate measure for the approximation to the block attractivity measure must 
necessarily possess 

I- The candidate measure should depend not only on the set of blocks m the partition that 
their attractivity is being cornpoted, Inrt also on how the rest of the attributes are 
blocked in the partition. That is, the same set of blocks may have a different 
attractivity depending on the partition they are in Because- 4 ©* this, we just cannot 
compute the attractivity measure of a pair of blocks by considering only the pair of 
blocks and their attributes. Rather, attractivities must be computed m the context of the 
partition as a whole. Consequently, the attractivity measure of a pair of blocks in a step 
of the pairwise grouping heuristic depends aho on Ate partition reached at that step. (If 
this were not the case and the candidate measure depended only on t.he pair of blocks, 
then it will become possible to redistribute the attributes m the rest of the blocks such 
that the ideal measure changes and becomes of a different sign signifying that the pair 
of blocks have become attractive or unattractive, while the candidate measure remains' 
unchanged.) It is for this reason that the attractivities computed in a step of the 
pairwise grouping heuristic cannot be used in future steps. 



Chapter 5 - 99 - The Attribute Partitioning Heuristics 



2- The candidate measure must be computed by considering all queries made to the file. 

3- The candidate measure should be applicable to blocks with any number of attributes in 
them, rathe* than just being applicable to btocks containing a single attribute. 

4- It must be possible to compute the attractivity of any number of blocks; i.e.. the 
candidate measure must be an n-wise attractivity measure. We need this capability 
because some of the other heuristics we shall consider require the mutual attractivity of 
more than two blocks. One might assume that the attractivity of a set of more than two 
blocks can be obtained by applying the pairwise measure to alt pairs of block in the set, 
and making the n-wise measure the sunt total of the pairwise measures. This is not true 
because the pairwise attractivity measure is not transitive iFor example, if blocks A t 
and A 2 are attractive, and likewise blocks A* and A» we cannot conclude that It is 
desirable to block A|,A 2 , and A 3 into the same block, because blocks A t and A 3 may 
be unattractive to each other. Even if it were the case that alt pairs of A, , A 2 , and 
A 3 were attractive, there is no reason to conclude that all three blocks are mutually 
attractive. Hence, an approximate measure must also be an n-wise block attractivity 
measure. (Note that the ideal block attractivity measure is an n-wise measure.) 

We have considered the attribute similarity measure of Hoffer and Severance [19] as 
a candidate for the approximate block attractivity measure. The pairwise attribute similarity 
measure of two attributes (or equivalent^ two blocks, each containing one attribute) is the 
ratio of the information transferred (from secondary storage to primary memory, for the 
purpose of selecting tuples that satisfy a predicate or for the purpose of projecting a value), to 
the total information transferred from a subfile consisting solely of the two attributes (when 
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answering queries that request one or both of the attributes). In other words, assume that the 
two attributes exist by themselves in one subfile When a query that request one or both of 
the attributes is made to the database, a portion ft* some queries alt) of the subfile's 
subtuples will be transferred to primary memory (depending on the Joint selectivity of the 
other attributes of the query that have already been resolved), and onto; a certain percentage 
of this portion will actually be useful data (because the first of the two attributes may further 
restrict the set of tuples to the tuples that satisfy a predicate on the attribute). This 
percentage is the ratio discussed above. There is one such ratio few each query type The 
weighted average of all such ratios will be the pairwise attribute similarity measure of the two 
attributes (the weight of a query type is its frequency). From this definition we see that the 
pairwise attribute similarity measure applies to onto; ffctrs of bteckf, wliew each Mock consists 
of only one attribute. (Hoffer |4] describes an extension of the attribute similarity measure 
that works for more than two blocks, but where each block still consists of one attribute only.) 
The. other problem with the attribute access similarity measure is ; that it does not depend on 
the other blocks in the partition; it just depends on the two attributes of the measure. For the 
above reasons, we have concluded that the attribute access similarity measure of Hoffer and 
Severance cannot be used as an approximate measure of the block attractivities. 

Considering the above prerequisites for an approximate measure, and the fact that 
the file cost estimator implemented turned out to be rather fast in estimating the performance 
cost of a partition (the speed of the file cost estimator U partly due to the rather simple query 
evaluation strategy that ijt uses), we decided to retain the ideal measure (i.e. the difference 
between the performance costs of the oW and new partitions) for the purpose of finding the 
most attractive pair of blocks in the pairwise grouping heuristic Using the ideal block 
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attractivity measure, we hope to bring out the real characteristics of the attribute partitioning 
problem, characteristia that might be blurred and obscured using a less than ideal measure. 
Indeed, the choice of the file cost estimator as the attractivity measure has proven useful tp us 
and has allowed us to considerably reduce the nurnoer of pairwise attractivity measures that 
need to be computed. The next section elaborates on how this reduction is achieved. 

— T1 * e y * st P*lrw*fff ftrowptog HmrlfUP 

In the previous section it was pointed out that the attractivity of a set of blocks in a 
partition is also dependant. on the other blocks '"of the partition However, in the course of 
our experimentation, it was observed that if blocks A, and A 2 were attractive to one 
another at a step of the pairwise grouping heuristic then > t , and A* remained attractive 
with almost the same attractivity measure at the next step, with one possible exception. The 
exception is when either of blocks A, or A 2 were gnxiped (together or with another block) 
in going from one step to the next step. In terms of the attractivity measure, we observed 
that: 

< 551 > *«A r .A v k.P> M oeffA^AyhPn,) 

or equivalently that: 

C<P) - C(P^) * C(P Jk ) - C(P |My ) for all x ^ j, k 

and y i* j, k 

where P - {A,}?.,, P ik - {A,, ,., A,u A* ^ *y.ffc- P^ - ^ U A„ Ay, .., A M _,}, and 
PjMy - f A t» •••» Aj u Ak, „., A, u Ay, _., Am. 2 }. It was ibo observed that the attractivity measures 
of pairs of blocks retained their relative order (in terms of magnitude) with respect to one 
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■request an attribute from A, u A* then d? ik ) - C(P) and CCPji^y) - C<P My ) and the 
query is nonvolatile. If the query does not request an attribute from A, u Ay, then 
C<P xy > - C(P) and C(P, My ) - C<P, k ) and the query is non-volatile. Another necessary 
condition for a query to be volatile is that grouping A, and A* should make a 
difference in the processing of the volatile query. That is, a volatile query must be 
processed with a different method in partition P (or P„y) compared to the method it is 
processed in partition P,„ (or Pji,^). If this were not the case and the query's method 
was the same for both P and P,,,, then the effect that grouping A, and Ay would 
have in the cost of processing the queij w)H be the same whether or not A, and A* are 
grouped, and hence will cancel one another out in 55.1. This will cause the query to 
become non-volatile. Therefore for a query to be volatile, it must not only request 
attributes from both A, u A* and A„ u Ay in iU selection component, but grouping A, 
and Ak should influence the method in which the query is processed. From this 
discussion we may conclude, that the queries in the database usage pattern are 
predominantly non-volatile. 

To reiterate, the observation 5.5.1 is because the queries in the usage pattern are 
predominantly non-volatile, and if a query happens to be volatile, it does not significantly 
determine the attractivity of A x and Ay. 

Observation 5.5.1 has a number of important implications. First, it says that the 
pairwise attractivity measure needs to be computed for alt pairs of blocks only once, and that 
the attractivity computed remains valid in subsequent steps as long as neither blocks of the 
pair participates in a grouping. Second, it says that the attract! vittes which need to be 
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the sorted list that contain either one or both of the blocks participating in the grouping. It 
then computes the palrwise attractivity measures between the newly created block and the rest 
of the blocks in the partition. The newly computed attractivity measures are then sorted and 
merged with the sorted list The most attractive pair of blocks is then selected from the top 
of the list and grouped for the current step The above process is then repeated on the new 
partition. 

Our experimentation with the fast palrwise grouping heuristic has been very 
successful. The fast palrwise grouping heuristic consequently produced the same final 
partition as the palrwise grouping heuristic produced in all the tests performed. 
Furthermore, it was drastically more efficient than the- palrwise grouping heuristic. The 
extent to which the fast palrwise grouping heuristic Is more efficient then the palrwise 
grouping heuristic increases as m increases and as the number of steps the heuristic iterates 
increases (the two heuristics iterate for the same number of steps since they produce the same 
final partition). In our experiments, the fast palrwise grouping heuristic was anywhere 
between 1.6 to 5 times more efficient than the palrwise grouping heuristic Because of its 
advantages, we have adopted the fast pairwise grouping heuristic instead of the palrwise 
grouping heuristic, as one of the two main heuristics of our attribute partitioning system. 

6. Other Variants of the Stepwise Minimization Heuristic 

In addition to the pairwise grouping heuristic described above, a number of other 
heuristics were developed and subseqUentH; programmed and tested. We witt describe the 
following heuristics: 
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1- the k-parttuon pairwtxfrwiptny t w umttc , 

2- the two most attractive pairwise grouping heoristfc, 

3- the trtplewise grouping heuristic, 

4- the bmarf d egtoupm g heortsac, 

5- the single attribute d roo ping heuristic, 

6- the double attribute deg ro upin g bewiSUi, 

7- the single attribute i 

8- the double attribute 

9- the single attribute m igj ws p ing heuristic. 

We will also give the overall molts of experiments OMdueM m s M ftwil envira wne nts , to 
that the cost effectiveness of each heorrttk <le. the opnraaf J u ) mmm i w al ta M partition when 
measured against the ^tort Teoorrefl to produce *)ms* it MWI Hewrtstlc I - 1 «nd « aw 
intended as alternatives <to the Ifatt) fwirwtte giouptag bawHstk , ei sla i hiuiUUts ft - « Midi 
degroup blocks hi each step) are intended to improvt a pun the resisfcant partition of the (fast) 
pairwise grouping heuristic. 

1- The k-partWon p»irwi« groopjiw, bearisttc - Thw lisiisism fc an extension of the 
pairwise grouping heuristic, m the fh-« step of this heartsttc, the % paitWom with the best 
performance costs are selected and pasted *> the next step. Each subsuaucnt step receives * 
pa rtitions and generates atl possible parttttoro that can be obtained *rnr* fhem %y pairwise 
grouping of blocks. Outof the sat of att^artitioi»geneaated«tha«Mt % as* selected and the 
process above is repeated. The pairwUe groeping hewatta: is* aaeuiil mi of the fcrparsition 
pairwise grouping heuristic where K- 1 . Thereto*, the t-fstrttuon pairwise grouping 
heuristic will always result in a partition that H at least as opttmal as the nsstHtant partition 
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of the pairwise grouping heuristic. The drawback of this heuristic is iu search cost, which, is 
roughly k times that of the pairwise grouping heuristic In view of the near optimally (and 
in most cases the optimality) of the pairwise grouping heuristic, it U not possible to justify the 
additional search cost that this heuristic incurs. 

2- The two most attractive pairwise grouping heuristic - We noted in Section 4 that 
block attractivities remain relatively unchanged from one step of the pairwise grouping 
heuristic to another if none of the blocks participate in a grouping. We also noted that the 
attractivity of pairs of blocks retained their relative order from one step to another. Hence, in 
each step of the pairwise grouping heuristic it is Hkeh; that the few most attractive pairs of 
blocks will be eventually grouped in subsequent steps, and therefore we can deduce that the 
two most attractive pairs of blocks in a step are good candidates for being grouped 
simultaneously in one step. 

Having this in mind, the two most attractive pairwise grouping heuristic was 
developed. This heuristic takes the top two most attractive pair* of blocks and groups both, 
pairs in one step. (If the top two most attractive pairs of blocks have a block in common, 
then all three blocks will be grouped in one step.) Each subsequent steps lakes the partition 
of the last step and simultaneously groups the two most attractive pairs of blocks. Grouping 
the two most attractive pairs of blocks has the advantage that the number of steps required to 
get the final partition is reduced by a factor of 2- However, this heuristic has the 
disadvantage that as the number of attractive pairs diminishes towards the end. grouping the 
top two pairs becomes too arbitrary a thing to do; If the first best pair of blocks and the 
second best pair of blocks have any block in common, then it is possible that the uncommon 
block in the second pair is not attractive to the new block formed by grouping the first pair 
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(as noted previously the pairwise atti-^vriy measure it not transitive). Another 
disadvantage with this heuristic is that even if the two top pairs of bkxks have no common 
blocks, it is possible that after grouping the first best pair, Mother block In the partition 
manifests a greater attractive to the newiy crusted M«* than «^ 

best pair of blocks. Indeed, teste show that the two moil attract!** pairwise grouping 
heuristic produces the same partitions in the first few steps a* does the pairwise grouping 
heuristic, but later on it divested and orodoced a less ev«ma4 partition. 

3- The tripftewlee grouping heuristic <■ TMthwriettc la «he third variation of the 
pairwise grouping heuristic we have considered. Thto hear***: g*oupc*4ples of blocks at 
each step instead of pairs of Mocks. In the pa irwis e giu a y tng heurfcufc, we generated at) the 
partitions by grouping pairs of block*. The laipkwM gio wa w a j, heuristic generates all 
partitions by grouping triples of btacka and thm aatas the partttto wtm best performance 
A nnmber of partitioning problems were solved using ffcfe heuristic In aH cases, the 
triplewise grouping heuristic produced a mmoptimftj partition, and in almost every case, the 
resultant partition was less optimal than the pairwise gioupt u g henrtatk paction 

The inferior perfoiwrnnee of this anuria* may he attributed to Che large number of 
blocks it tries to group in one step. Cturnping three blacks tmmfcaa w sms ty m one step is 
somewhat too coarse an action. For exanipk, in soaw experhneata we obaerved a set of three 
blocks A,, A 2 .atid% thai were ail p^ 

A , and A 2 were grouped together, .A3 was no fanger attractive to the Mack A, a A z (i*. 
<*<{ A , , A 2 } { P) > and .*({*„ A& « > and **&* *&n> *; bat eaflA,, A* A,)} P) * O). 
The triplewise grouping heuristic aho suffers framttke drawbacks of the two most attractive 
pairwke grouping heuristic mentioned above Fmafty the tripktwtse grouping heuristic U on 
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the order of m slower than the pairwise graipinf heuristic, in that it requires 
binomi«l(m-2 i+2, 3) attract! vities to be computed in step i. 

4- The binary degrouping heuristic • This heuristic resembles a logarithmic search. 
It starts with the one-file partition and tries to divide it into two blocks so that the 
performance cost improves. After this is accomplished, the heuristic tries to divide each of 
the blocks into a pair of smaller blocks and so on. Dividing a block is accomplished by 
finding all attributes that if excluded from die block, result In a reduction of the performance 
cost (i.e. finding aH the attributes in the block that are urtMtractive to it). At each step, a 
maximum of half of these unattractive attributes are degrpuped from the block into a 
separate block of their own. Although this heuristic U ve^|^ Jn that it is logarithmic, it 
performed quite poorly, and tended to produce jt frtv|al v partition at the end, On<? may 
speculate that this heuristic can be useful for a usage pattern where the optimal partition 
consists of 2 or 3 blocks. 

5- The single attribute degrouping heuristic - This heuristic along with all the 
remaining heuristics are degrouping heuristics. A degrouping heuristic takes a partition, and 
separates out one or more attributes from one block of the partition. In, terms of the lattice 
model of stepwise minimization, the degrouping heurlstiq amount to going up an edge, from 
a partition in one row of the lattice-to another partition in tte to* directly above. Except for 
heuristic 9, all the degrouping heuristics are intended to be, used to Improve upon the 
partition, that has already been arrived at by the process of pairwise grouping. By using the 
fast pairwise grouping heuristic and the degrouping heuristics alternately, we try to further 
optimize the resultant partition. When the pairwise grouping heuristic and the degrouping 
heuristics are applied alternately, the process of grouping and degrouping blocks may be seen 
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improvement. Again, experiments produce?! negative results and in none of the cases tried 
did it improve on the partition produced by the pairwise grouping heuristic. 

7- The single attribute degrouping-regroupmg heuristic - This heuristic eliminates 
the drawbacks of the previous degrouplng heuristics by degrouplng an attribute from its 
block and simultaneously grouping It with another block, aH in one step. This strategy is 
based on the reasoning that an attribute which has been grouped fairly early in the pairwise 
grouping process must have had a high attractlvity m the first place. It Is unlikely that this 
attribute should exist by itself in a separate block. More likefy, in the later steps of pairwise 
grouping, it becomes more attractive to other blocks than to its current block. Indeed, in the 
few cases where the pairwise grouping partition was not optimal (as compared to the optimal 
partition found by exhaustive enumeration), the optimal partition could have been obtained 
from the pairwise grouping partition by transfering one or more attributes from one block to 
another. In terms of the lattice of attribute partitions, this Amounts to a traversal from a 
partition of one row to another partition of the same row by going up an edge and coming 
back down another edge. 

The single attribute degrouping-regroupmg heuristic first selects the attributes that 
reside in blocks of two or more attributes. For such an attributes, it computes the attract! vity 
measure of the attribute with all the blocks m the partition except for the block that the 
attribute currently resides in. It then selects the attribute that Is most attractive to some other 
block and groups it in that block. If this improves the performance cost over the 
performance cost of the current partition, the heuristic Iterates on the improved partition. In 
the cases where the pairwise grouping partition did not coincide with the optimal partition, 
the -single attribute degrouping-regrouping heuristic when applied to the resultant partition 



Chapter 5 — M^ TfctAMrifenlftfMftifcteiiteff Heuristics 



of the pairwtse grouping heuristic resulted In the optimal partition. (Observe that if applied 
to the optimal partition, the single attribute d sgwwjna, l e giuupii s g heuristic will immediately 
terminate and return the optimal partition.) it it a*fte»* m aetemme «*e single attribute 
degrouping-regrouping heuristics search east A* m kmm*4m.to *l<a»n where the 
heuristic was applied, it did not reouire mow than three step* to oooctudft. ■ 4 

The fast pairwite grouping henrittto in awhination with Jhe single attribute 
degrouping-regrouping heuristic has proven superior to ail the other heuristics as a result of 
our investigations. Conseoooitth/, we have selected the ^st-patrwiif grouping heuristic in 
conjunction with the single attribute de gr o u ping - re gioupin g heu r ist ic as the ohfttoe of the 
partitioning heuristics for our attribute ptrtittoitiog system. 

8- The double attribute dfgnrH i plnsj i^ i uMph i g Mmisifc - thte I wojii li c is similar 
to the single attribute degrouping i e gioupi»K hewtl aptlh** attempts W oegroup a 

pair of attributes from the same block and group both baa* with some other block. 

This heuristic has the Advantage over the silUjfo attt i bum de giw o pine regroup ing heuristic 
that degrouping-regrouping any one of two attributes in one step mar not result in an 
improved partition, while degroup ing * o #oup«% « e*ev m one step may 

produce an improvement. In none of our ex p er im en t s did we run into such a situation where 
double attribute degrouping-regrouping produced an improvement white single attribute 
degrouping-regrouping did not produce any i m p rov e me nt Abo, eurittic suffers from 

being too coarse, as was the case with the triptewtse grouping heuristic. If 
degrouping-regrouping just a single attrib is beneficial, c will not detect that. 

Hence we have dropped this heuristic from further ceftstaeratton m favbr of the single 
attribute degrouping-regrouping heuristic. 



Chapter 5 -IIS- The Attribute Partitioning Heuristics 



9- The last heuristic that we developed and tested is the single attribute ungrouping 
heuristic This heuristic is a degrouping heuriitlc in that R attempts to degroup attributes 
from their blocks. The single attribute ungrouping, heuristic performs the functions of both 
the single attribute degrouping-regrouping heurUtk and tb| single attribute degrouping 
heuristic in one step, and it is thus »upertor to both of them: When an attribute i^ degrouped 
from a block in a step, it may be regrouped back with other blocks, or placed in it* own 
block; In terms of the lattice of attribute partitions, (kxtw of the) partitipn* reachable from 
the last partition by either an upward edge (to directly one row above, i.e. degrouplng). or all 
partitions that belong to the same row and arf reacMbte by fo|k?wing an ec|ge up and 
another edge down (degrouping-regrouping) are coniidered. 

We mentioned previously that the smgle attribute def^ouping heuristic does not turn 
out to be advantageous if applied to the resulbjnt Dartttton of the pairwise grouping 
heuristic Hence, if applied as a sequel to the paJrwise grouping heuristic the single attribute 
ungrouping heuristic will only be a* good as the single attribute 4egrouping-regrouping 
heuristic The single attribute ungrouping heuristic is thus meant as a stand alone heuristic: 
It is initially applied to the one-file parttUon; the mlo^ia^qp -proceu move* upward from 
the bottom of the lattice until the optimal partition or a near optimal partition is reached. In 
this sense, the single attribute ungrduping heuristic Is the inverse process of the combination 
of the pairwise grouping heuristic with the single atftibute degrouping-regrouping heuristic. 

Experimentation with this heurtitlc ht» generafly been satisfactory. In the smaller 
attribute partitioning problems (in terrhji of the rrtimber of attribute*), the single attribute 
ungrouping heuristic has been able to locate the Optima! partition. In the larger problems, 
the heuristic did not perform as weff as me pairwise g r o up i n g heuristic, and resulted in a less 
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optimal partition. 

The drawback of the tingle attribute ung4*upinf heuristic is ma*ntv m ifewneas in 
reaching a locally optimal partition. This is due to the i^bnajnf o*o feasor*, h At each step 
of the heuristic, all the attribute* in non^ngkttg fele^ |»e| e |tf| fcw >»»» ^|e f ronp e d , but 
at each step these attributes have to be regrouped with a» of th* feuM*tmg '*****• H U hatd 
to assess the order of search cost for tty* heuristic, but fc appears thjtt H Utfce* as many 
partition evaluations as the patewJae ffouping and the smgte attribute 
aegrowping-regrouping heuristics combined, fr fr mo* p*rt*jhi|Mi»* B fn b i e wt considered, 
the optimal partition was much closer, both hi terms of the, numbff of eefcef *» (he lattice of 
partitions and in terms of performance com to ***» <**> to »he one-file 

partition. I.e. the distance, h* terms of the, wmber of edge* a* partition was 

much smaller to the top of fee lattice than to the bottom of taw b)t#ce. T4i» points to the 
additional advantage that * stepwise mmlintiatten heurtotic that Harts from the trivial 
partition has over a heuristic that starts from the ojwr fife pafttftm. 

Z- Exporitntmttag with tkn 

The effectiveness of our approach to attr^te partitioning in a seif-organUmg 
database system can be fujly evaluated only by emptying it over an e^UendecL period of tiene 
in an operational setting. However, it is possible to obtain some partial allotments of these 
techniques by means of experimentation in a coM«^ en Virooment 

We have conducted an extensive program of «tperlmeot«Uort with thf pairwis* 
grouping heuristic, the fast pairwise grouping hwrisUc, and the single attribute 
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degrouping-regrouping heuristic. We have also performed a number of ten* on the other 
attribute partitioning heuristics. The overall result for each heuristic was reported in the 
discussion of that heuristic. In general, none of the attribute partitioning heuristics 
performed as well as the (fast) pairwise grouping heuristic, and none of the degrouping 
heuristics performed as well as the single attribute degroupmg-regrouping heuristic. 
Henceforth we will concentrate on providing a detailed description of the experimentation 
performed on our main heuristics and provide an assessment of the performance of these 
heuristics in comparison with the exhaustive enumeration procedure. The exhaustive 
enumeration procedure performs an exhaustive search of the full space of possible partitions 
and so is guaranteed to locate the true optimal partition. 

Each experiment was concerned with selecting an optimal partition for a particular 
database in the context of some given usage pattern. In each experiment, the pairwise 
grouping heuristic and the fast pairwise grouping heuristic were tried suiting with the 
trivial partition. Thereafter the single attribute degroupmg-regrouping heuristic was tried 
starting with the resultant partition of the pairwise grouping heuristics (i.e. the pairwise 
grouping heuristic and the fast pairwise grouping heuristic). If the single attribute 
degrouping-regrouping heuristic was relevant at that point, then the pairwise grouping 
heuristics were retried starting with the resultant partition of the single attribute 
degrouping-regrouping heuristic In addition, using the exhaustive enumeration procedure, 
the optimal partition was found for the same set of database parameters and the database 
usage pattern. 

In an operational setting, the usage pattern submitted to an attribute partitioning 
system would be based on historical records of database use; in our experiments, we generated 
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these patterns ourselves, hhrturaily, a fit* program of ai^Mntkatkm would require 
performing experiments for the fuH spectrum of data characterise and usage patterns; since 
this was obviously infwstbte, we had to restrict our studies la a limited aet of coats. Selecting 
these cases posed a real dilemma. We did not want to ch ees e them ot random, for we might 
thereby focus on uninteresting or u nr e presen t ativ e examptes; but af we selected them too 
carefully, we would potentially sacrifice some of the validity and generator of our testing 
procedure. In consequence, we seeded on the fohowtag approach. For each experiment, we 
provided a set of database parameters and a set of ether psramitaii «bot roughly *nA 
succinctly characterized a particular kind of usage pattern. These succinct oaage pattern 
parameters were then used to drive a wage pattern generator, areaahinitic in nature: It 
randomly generated a set of individual t riiua c ta wu m acc o rdaina n i ttu hefiven parameters, 
and then summarized them in a usage pattern rlrarisaliuii. The ottpot of «* generator was 
then used as the usage pattern for the experiment 

In «H our experiments, we have kept two of the database parameters fixed: n, the 
number of tuples in the file, was set to 100,000 and S. the page site, was set to 1024 
words. We believe these figures reflect typical values tor real (Isfahan i «n anr event, the 
behaviour of our attribute partitioning system should not differ radJcalry for problems with 
different values for n and S. Abo, we are permitted to constrain mi least one database 
parameter without reducing the degrees of freedom inherent in our probiem. 

The results of our experimentation are depicted to Tablet I through 9. Our most 
extensive series of experiments were conducted win files of Sand 22 attributes. The limiting 
factor in our experimentation was the use of die exhaustive enumeration procedure. Even 
for a file with 8 attributes, the number of possible parttttero in the search apace is 4140 and 
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using the exhaustive enumeration procedure to find the optimal partition for the problem is 
a nontrivial task, one that can only be accomplish ed with difficulty. For problems with 
m > 8. we have not been able to use the exhaustive enumeration procedure to verify the 
performance of our attribute partitioning heuristics. However, we can reach a number of 
conclusions concerning the behaviour of the heuristics for problems with large m by 
extending the result* obtained from problems with msa 

Table I shows the characteristics of our experimentation. Each row shows the 
number of experiments carried out for an attribute partitioning problem with 5, 6. 7, 8. 15, 22, 
or 30 attributes. The average number of conjunctive query types and the average number of 



Number of 
attributes 


Number of 


of cofuuncjwns 


•fflsr 


^- ^W*8* ewe-f^o 
partition cost ' 


'eVnarag* trivial 
partition cost ' 


5 


10 


8 


5 


1J0 


0.708 


6 


10 


8 


4 


1.0 


0.449 


7 


14 


21 


6 


IjO 


0;184 


8 : 


20 


21 


10 


1.0 


0.499 


13 


10 


45 


19 


1.0 


0.189 


22 


20 


70 


12 


1.0 


0.122 


30 


6 


60 


25 


1.0 


0.505 



Table 1 Characteristics of the experiments. 
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disjunctive query types for each row is depicted in columns S and 4 of the table. The last 
column of Table I shows the performance cost of the trivial partition with respect to the 
onefile (un partitioned) performance cost, that ft. «ir attrtbu problems with a 

certain m, the ratio of the cost of the trivial partition to the cost of the ur^rtfcioried Me was 
computed, and column « of Table 1 shows Ink averafe of such rettm for each m. In the 
remaining tables, ail columns superscripted by T indicate that the figures m the column is the 
average of the ratios with respect to the one-frte partition. 

Table 2 show, the rewk of e*p*»*n*»* with »•«. The optimal partition found 
by exhaustive enumeration and the partition found by the pairwise grouping heuristic and 
the fast pairwise grouping heuristic coincided In al of the ten experiments {at indicated in 
cahimn S). The average AttuT tht^ftnaf ji^ t^nd ^ ft^ lo * one^tte 
partition was 0.59. The heuristics iterated on the average for 25 steps. Thti Includes the hut 
step where the heuri«»es could not improve upon me hit partition and returned the partition 
found by the previous step. « 

Number of attributes: 5 
Number of experiments: 10 

Average cost of best Number of ships ^ Nwrtm of time* optimal 
partition found rw^ffrrafrd, ' oarfmnn was net found 



25 

26 .,-. 
Tablet Result of experiments with m-5. 
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0.1*2 


Pairwise 




grouping 


0.592 


heuristic 




Fast 




pairwise 




grouping 


0.5*2 


heuristic 
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Tables 3 and ♦ depict the results of exp ei en e m a with m - 6 and 7. Again In all the 
case* testedr the optimal (written was found by both the p an wis * g ro uping heuristic and the 

Table 5a shows the rasvt of ' the- 20 saaerimenti pertained with m - 8. As Indicated 
in column 8, the pairwue ffoamftng heurta^«nd tha fast paitwise grouping heuristic found 

Number of attributes: 6 
Number of experiments: 10 

Average cost of beet Number of steps Number of times optimal 

■frtWwfitnrf; 1 teWLtoato partition wjn vtf l9vnd 

Exhaustive " * ' ' f '" • '**»*»"« .^-y^" 

enumeration 0.434 - - 

Pwrwfoe 

grouping 0.434 2.1 . 

heuristic ■ 

F«$t, 

grouping 0.494 ^ 

Heuristic 

Table f . Result of experbMnts with *♦•$."■ 



Number of^iMributeet 7 ■■■-■■ 
Number of ex p eri ment s: 14 . 

Average cost of best Number of steps Number of times optimal 

•« V PtrtWmtaff*' . : tau^fijesjtal ^ MrttticjftWfftlgygyB^ 

Exhaustive 

^pnumav^tiohk- --:':;-'; ..; ; 0478 -...':■■■ -■■■/'? • -. ; . •-. '. 

Pairwise 

grouping 0.17f> $0 O 

heuristic 

Feat 
pair wise 

eupUi^ i:; 047* ■-,■■ .'.•-.,>**,?■■ r.'. •>:;.' . 0'" 



EL 1 

hei 



urittic 



Table 4 Result of experiments with m « 7. 
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Number of attributes: 8 
Number of experiments: 20 



Average cost of beet Number of slops 
ptrfitipft found, » 



Exhaustive 
enumeration 


0.46S7 


■ :■»*.-' 


Pairwise 
grouping 
heuristic 


04«o7 


**V"; 


Fast 
pairwise 
grouping 
heuristic 


0.4657 


48 



Number ef times optimal 



Table $a - ' Rewk of eKpcrtAMMs with m • S. 



Number of attributes: 8 
Number of experiments: I 



Average cost of bet! 

EirJiiigsiafflii 



Number of tteee 

htvrftfr 'forth* 






Single 

attribute 

degrouping- 0.9999 ..,........£© 

regrouping 
heuristic 

Table 5b Result of the Jingle attribute 



© 



E«Hl"e>««L. 



the optimal partition in a* but <>ne <eae. Tab* lie shows the i«jb^ of applying the tingle 
attribute degrouping-regrouping heuristic to the mutant partition of the paJ i wh e g HJuptng 
heuristics in the one proWem where the pairwise grouping heutHOa *td no* find the optimal 
partition. The single attribute degrouping-regro u ping heurtstk found the njtttmal partition 
in 2 steps (where the second step did not find an improyew o ut te tfte+resu^^ 
The ratio of the resultant partition of the single attribute deg roup l ng lagiislip tnt heuristic to 
that of the pairwise grouping heuristics was 0.9999. lev the iocat minimum found by the the 
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pa irwise grouping heuristics was within 0.0001 of the optimal solution. (Columns 
superscripted with Y is the performance cost with respect to Hie performance cost of the 
partition found by the pajrwise grouping heuctsUcsJn the pre*eu#tabhr.) 

Table 6 shows the results for problems with m - 15. In this series of experiments 
and in all the experiment* with larger m, the exhaustive enumeration procedure had become 
infeasible (for example, W15) fit 1.38 « 10 9 ). However in all the experimenu with m - 15, 
the pairwise grouping heuristic and the fast pairwise grouping heuristic coincided in their 
resultant partition. The average cost of the best partition found by the heuristic was 0.15 of 
the cost of the one-fite partition for this set of experiments. Also, none of the degrouping 
heuristics were relevant at this stage. 

Tables 7a and 7b show the result of 30 experimenu with m • 22. In half of the 

experiments, it was possible to improve upon me result of me pairwise grouping heuristic 

(which always coincided with the result of the fast pairwise grouping heuristic). The.JingJe 

attribute degrouping-regrouping heuristic was applied in these K) cases, and the partition 

Number of attribute*: MS 
Number of. experiments: 10 

Average cost of best Number of steps Number of times a better 

paction, found 1 frurWft tttffted paction, w»« found 

Exhaustive 

enumeration - ,. - :> 

Pairwise 

grouping 0.1601 83 } 

heuristic 

Fast 

pairwise 

grouping 0.1501 &3 ^O 

heuristic 

Table 6 Result of experiments with » - IB. 
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Number of attributes: 22 
Number of experiments: 20 



Average cost of best Number of steps M umper of time* * better 

ptrtiti«BJoye# r 



Exhaustive 




enumeration 


»• 


Pairwise 




grouping 


o.ioio 


heuristic 




Fast 




pairwise 




grouping 


6,1010 


heuristic 





i©.» 10 

*$* it 



Number of attributes: 22 
Number of experiments: 10 



Average cost of best ****** ef steps humesi of thee* * better 

partition found* 



Single 

attribute 

degroupihg- 0,11973 fj 

regrouping 

heU*iS#C 

Table 7b Result of the single attribute d«gr«»0ing^r«graipihg fttWUtR W*h m * 12. 



reached by the single attribute deg r»tt p m| ii|Tii f tpg heurtrtte «si ehuet such that no 
other (pairwise grouping or degrouptng) rtettffctk cowM improve upon ft. ft** single 
attribute degrouping-regrouptng heuristic feerated tor an »mgff of $* steps. The 
performance cost of th* partition found using the single aftrfhajce J s ^iwyiw^ -r eg ro up t ng 
heuristic was on the average 0.0029 less than that found by the petrntoi gi upplna, heuristics. 

Tables 8a, 8b, and 8t show the result of experiments wtth m * 30. In most of the 
experiments, the single attribute degrouptng- i e giwipHn ***** •■» •** *» tthjnw* i« 






Chapter 5 



123 - The Attribute Partitioning Heuristics 



Number of attributes: 90 
Number of experiments: % 



Average cost of best : , ffpflK.fi 



Exhaustive 
enumeration 

Pairwise 
grouping 
heuristic 

Fast 

pairwise 
grouping 
heuristic 



0.30570 



0.30575 




Number of tima* a better 



21.2 



21.0 



Table 8a Result of exp er ime nts with m - 30, 



Number of attributes: 30 
Number of experiments: 5 



Single 
attribute , 
degrouping- 
regrouping 
heuristic 



Average cost of best Number of steps 



0.9929 



Number of times a better 

llMfftfjIMPf 



2.3 



Table 8b Result of the single attribute degtouptng-regrouping heuristic with m - 30. 



Number of attributes; 30 
Number of experiments: 2 



Average xost otheet : .; |iu s ^r^ elipe, 

2.0 



2.0 





partition found* 


Pairwise 
grouping 
heurfstk 


0.9995 


Fast 




pairwise 
grouping 
heuristic 


0.9995 



Number of times a better 
partition was found 



Table 8c Result of reapplying Hie pairwise grouping heuristic with m - 30. 
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average of 0.0071 on the partition produced by the pa^wise groups** heurtfifea. In 2 out of 5 
cases where the single attribute degrouping-regjwping heur jlied. the pairwise 

grouping heuristics when reapplied starting with the final partWon of the sJngte attribute 
degrouping-regrouping heuristic, produced a partition with an awage further hnprorement 
of 0.0005 in performance cost No other heuristic could Improve on the partition found by 
the reapplication of the pairwise grouping heuristics. The Qftlumn superscripted by T in 
Table 8c indicates that the figures in the column is the percentage improvement with respect 
to the partition found by the single attribute de g roup l ng t egi oupi n g heuristk in Table Sb. 

Table 9 depicts the execution time statistics. Far all of the heuristics considered, tbe 
number of partitions evaluated increased as the number of attributes of the increased. 
The exhaustive enumeration procedure had tJWiibstest rale *f ttoieajB «frHe the pairwise 
grouping heuristic had a drastically -tower rate. The fast pefrwise grouping heuristic had an 
even lower rate of increase than the pairwhe grouping heurisUc mihe number of partitions 
required for evaluation. (A*,** .$howjd, in Jbe pjtv4ou**JottioaB^ 
evaluations for the exhaustive enumeration procedure It appl oicimately on the order of m*, 
for the pairwise grouping heuristic en the order of m 3 , and for fH| ftJrwise grouping 

heuristic on the order of m*) With w * 30, the fast patfWisr gtauphi^ heuristic required 
roughly 1/5 the number of partition evaluations required by the exhaustive enumeration 
procedure with m - 8. This is a significant improvement Note that th^ processing time 
required by a heuristic depends both -on the number of pactions <vahHU«d?and abo on the 
number of query types in the usage pattern. For example, the processing time for the last 
pairwise grouping heuristic it** the order of the pi um>U of rhe^nuwber of query types in 
ihe usage pattern and the square of the number of attributes in the fHe. 
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Exhaustive enumeration: 



5 

6 

7 

8 

15 

22 

30 
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52, 

203 

877 

4140 

«438# 5 10* 

*4S0*1O ,, 

at 8.46 • 10 23 



Average processing 
IEBkusESsSISeL 

6.8 

28.5 



2212 



Pair wise grouping heuristic: 





Average number of 


: . .Average processing 


Number of attributes 


partyiom sveluttaJ 


iime in seconds 
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Table 9 Execution time statistics. 
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The result of these experiments may he summarised as fottows: For problems where 
m £ 8, both the pairwise grouping heuristic and the last pairwise g i uuptef biurisUc found 
the optimal partition in all but one case. The optimal partition found was significantly 
superior to the one-file partition in all of the cases tested. In the out case where the pairwise 
grouping heuristics did not find the optimal partition, the partition that they found was 
within 0.0001 of the optimal partition. In this case, the single attribute 
degrouping-regrouping heuristic when starting with the final partition of the pairwise 
grouping heuristics found the optimal partition by the transfer of orri> a stngle attribute from 
one subfile to another. Altogether, for m % 8 where the optimal partition could be determined 
by exhaustive enumeration, the combination of a pairwise grouping heuristic and the single 
attribute degrouping-regrouping heuristic was always able to find the optimal partition. For 
problems where m > 8, it was not possible to directly verify the outcome of the pairwise 
grouping heuristics by exhaustive enumeration. However In most of the problems tried, the 
pairwise grouping heuristic found a partition that no «fher degrouping heuristic could 
improve upon. In addition, the partition found by the paiiwise grouping heuristics was 
always significantly superior to the one-file partition. In tew than half the experiments with 
m > 8. the single attribute degrouping-regrouping heuristic improved on the resultant 
partition of the pairwise grouping heuristic; and hi these cases, the improved' partition 
differed insignificantly from the resultant partition of the pairwise grouping heuristics, 
attesting to the desirability and the near-optimaHty of the pairwise grouping heuristics, in 
the very few cases where the pairwtse grouping heuristics were relevant. to the partition 
found by the single attribute degrouping-regrouping heuristic the corresponding 
improvement in the performance cost was negligible. AH this indicates that the combination 
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of a pairwise grouping heuristic and the singta attribute degrouping-regrouping heuristic 
(when applied alternately) converge rapidly to a desirable soltition. IC we can extend the 
results for cases where m s 8 to the cam where m > 8. we may conclude that the optimal 
partition does not differ at aH or does not differ significant!? in performance cost from the 
retuttaitt^rtiaotf^ and tn e single 

attribute d eg r o u p in g #e gio uping heuristic " 

A number of observatlom were also made during the experimentation concerning 
the behaviour of the attribute partttKwmg syrtem. Speclflcalh;, we comment on the following: 

I- A number of experiments were performed with database parameters and usage patterns 
that were variations of the parameters of an earHer experiment That is, we took the 
database parameters and the usage pattern of a problem and we changed the database 
parameters, such as the set of available indices and the attribute lengths and selecttvities, 
or we changed the frequency and/or structure of the query types. (In aH these variations 
the number of attributes, the number of tuples, and the page sire were kept constant.) 
We then applied the attribute, partitioning heuristics (and the exhaustive enumeration 
procedure, when possible) to these variations to observe how r the optimal partition 
differed form one problem to another. As it turm] out, changing the database parameters 
such as the attribute lengths and setactrvittej, soinaames even radically, did not 
significantly alter the optimal partition: at most one or two attributes moved from one 
subfile to another. However, changing the usage pattern, Such as the frequency of query 
types or the composition of the co m pone nt* of the query types* sometimes drastically 
altered the optimal partition. Hence it appears that In our environment, the optimal 
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partition is more sensitive to the usage pattern and test sensitive to the database 
parameters, and thus that the usage pattern is mew a determinant of the optimal 
partition than the other parameters. 

2- It was observed that the larger the number .of ava^ablt mdke* ior an attrtbtite 
partitioning problem, the larger the subfiles of the ojrttoai awtiUVm4*,mor« attributes 
were clustered together in tM same subfile), and the sm*«er4he number of subfiles in the 
optimal partition. In addition, the perfortnance cost of the ootimei partitten (and afoo 
that of the trivial partition) became closer to the per form a nce cost of the one-file partition 
as the number of available indices were increased. We provide the following as an 



explanation for this observation: If an attribute is Indexed, the index will most likely be 
used to resolve that attribute when the attribute U requested bf the selection component 
of a query. If the attribute Is not indexed, then the attribute hat to be resolved by either 
sequentially searching or by linking, and any other attribute that exists with this attribute 
in its subfile win also have to be retrieved in its entirety. ThJt wHt incur extra page 
accesses. Even if the other coexisting attribute was requested In the selection component 
of the query, it is always easier to Mnk to the. co-existing attribute (from a reduced Tf 
list) in order to resolve it rather than sequentially search the coexisting attribute in its 
entirety or link to it from a larger TID list Therefore, having an index available on an 
attribute will in most cases eliminate the need for its sequential searching or linking to 
the subfile containing the attribute, and hence, increase the overaH attractivtty of the 
attribute to the other attributes in the file. Therefore, indices contribute to the clustering 
of attributes In the file. This point will again be taken up in Section 6.2 when we discuss 
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the relative efficacy of the trivial partition. 

3- in Table 8a we -see that the average cost of the partition found by th* pairwise grouping 
heuristic is not exactly the same ai that for the fast pairwise grouping heuristic It turns 
out that for one of the experiments with m - 30, the outcome of the two pair wise 
grouping heuristics did not coincide. Although It Is not impossible that the resuHs of the 
two heuristics be different, in Section 55 we argued that this is highly improbable. 
However, It appears that this becomes more probable a* m Increases because the 
heuristics have hi iterate for more steps and the pairwise attractivity measures computed 
by the fast pairwise grouping heuristic In the hrst step becomes less likely to hold as the 
heuristic iterates for a large number of steps. Thii was th« case In the one problem 
mentioned above I.e., a pair of attributes, neither which participated in a grouping, 
turned out to be unattractive in the latter steps of the pairwise grouping heuristic. 
Unfortunately k is precisely for these problems with targe m that the fast pairwise 
grouping heuristic is most needed. 

We have developed a third pairwise grouping heuristic to remedy this discrepancy. 
This heuristic (we may caN it the general pairwise grouping heuristic) is a combination 
of both the pairwise grouping heuristic and Hw ^ pairwise grouping heuristic. The 
general pairwise grouping heartstk: stttm with the trrvial partMon, ami attempts pairwise 
grouping In the same way as the fast pairwise grouping heuristic does, but only for a 
limited number of steps, say k, steps. (The fist pafrwise grouping heuristic iterated for 
an unbounded number of steps until it couH not produce in improvement to the last 
partition found.) The partition that results after K, steps is then taken and the fast 
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pairwise grouping heuristic is then reapplied to this partition but only for K 2 steps. 
This process is repeated (each time for a limited number ..of steps)- until at some point the 
fast pairwise grouping heuristic produces a partition .that it cannot improve upon. 
Everytime the fast pairwise grouping heuristic is, reapplied, aM pairwise attractivity 
measures of blocks are recomputed; even pairs of blocks tfeat did not participate in a 
grouping since the last application of the fast pairwise grouping heuristic will have their 
attractivity measures recomputed. In this manner, if^ i^~aM chosen such that they 
are comparatively smaller than the number of steps the pauwise grouping heuristic is 
expected to iterate if applied to the w same problem, then disoepeneie* Uke the one 
encountered in the above problem will most uaely b« eliminated, k^kg, _ are most 
effectively chosen if k, * k 2 * ..-. i 1, because as the process of pairwise grouping 
approaches the optimal partition, pairwise block attractivity measures become smaller, 
and there is a greater chance that a pair of blocks may be attractive in one step but 
unattractive in another step. For example m the piokttm wisb m • 90^ we chose k, - 6, 
k 2 - 4, k 3 -2, and K 4 - k 5 - ..... . 1. The reason that the general pairwise grouping 

heuristic is a combination of the two other pairwise grouping heuristks is that if 
K, m k 2 - ..... - i, then the general pairwise grouping heuristic will be the lame as the 

pairwise grouping, heuristic, while if k, is taken to be unbounded, then the general 
paii wise grouping heuristic will become the fast pairwise grouping heurtttk. 

We have tried the general pairwise grouping heuristic on a« of the si* probJems 
with m - 30. It always found the same partition ai the partition that the pairwise 
grouping heuristic found, and on the average the: geneial patr^im grouping heuristic 
required the evaluation of 1168 partitions and took 1596 seconds of processing time to 
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accomplish that. This compares favorably with the palrwise grouping heuristic where 
the respective figures are 9964 partition* and SM4 seconds. The processing time depends 
on how the paranietefs *!.**»*» _ are selected. The larger that rhey are chosen to be. 
. the faster the heuristic will be, but the smeller the probability that the partition round 
will coincide with the partition found by the pairwise grouping heuristic. 

In the next chapter we wilt describe in detail one of the experiments that we have 
performed. We will provide the database parameters, the usage pattern parameters, and we 
will show the sequence of partitions found by the pairwise grouping heuristics. The problem 
we have chosen to elaborate in Chapter 8 is the one ptubtroi with m - 8 for which the 
pa irwise grouping heuristic did not And the optimal partition. 
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CHAPTER 6 
AN EXAMPLE OF ATTRIBUTE PARTITIONING 



We have presented the combination of the fast pairwise grouping heuristic and the 
single attribute degrouping-regrouping heuristic as the main, heuristic technique of our 
attribute partitioning system. We have also presented a number of other attribute 
partitioning heuristics. In Section 7 of Chapter 5 we reported on the results of a program of 
experimentation with the pairwise grouping heuristic, the fast pairwise grouping heuristic, 
and the single attribute degrouping-regrouping heuristic We applied the* three heuristics 
to a number of usage pattern histories in the context of different database environments. We 
also performed a number of experiments on the other attribute partitioning heuristics, and we 
reported the overall results of the experiments in the previous chapter. In summary, none of 
the other' heuristics performed as well as the fast pairwise grouping heuristic; the fast 
pairwise grouping heuristic consistently produced the same partition as the pairwise grouping 
heuristic, while requiring only a fraction of its time In all the cases that we tested, the fast 
pairwise grouping heuristic consistently produced either the optimal partition (as determined 
by exhaustive enumeration) or else a near optimal partition that differed from the optimal 
partition by less than one percent. In those cases where the resultant partition of the fast 
pairwise grouping heuristic was nonoptimal, by using the single attribute 
drgrouping-regrouplng heuristic to improve upon the resultant partition, we were able to 
obtain thr optimal partition in at most three steps. 
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1. Th» Statement and Solution of tho Problem 

In order that the reader develop a feel for the form and magnitude of the 
partitioning problems that we have considered and the heuristics used to search for their 
solutions, we present an example of an attribute partitioning problem In this section along 
with the solutions obtained by applying a number of the attribute partitioning heuristics. We 
have included only 8 attributes in the relation of this example (BX8) - 4140), so that we may 
arrive at the optimal partition by exhaustive enumeration of all possibilities. The attribute 
partitioning problem considered here is typical of the p r o ble ms we have solved in our work. 
It is also an example of a problem where the single attribute degrouping-regrouping heuristic 
was relevant. 

Figure I shows the database parameters: ro, n, A, S, Jm and sj . Figure 2 shows the 
usage pattern query types as described hi the query type table. The query type table contains 
the frequency of each query type, the connectivity of the predicates (conjunction or 
disjunction) in the selection component, the attributes in the selection component, and the 
attributes in the projection component. The Joint selectivity of the selection component 
(although not actually included in the query type table) is also depicted. 

The results of applying the heuristics and the exhaustive enumeration procedure on 
this particular example are shown in Figures S-6. The total cost of processing the set of 
queries, when the file is unpardoned (i.e., the one-file partition), is 1881286 pages (Figure 3). 
This processing cost was calculated by the file cost estimator in the manner described in 
Section 4.4. Figure 4 shows the partition that results from applying the pairwise grouping 
heuristic; the fast pairwise grouping heuristic produced an identical result. Both heuristics 
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m-8 The number of attributes m the relation. 

n - 100,000 The number of toptes m the region. 

A - { 1, 2, 3, 4, 5, 6, 7, 8} The attribute* of the refatton. 

{1,3} The Indexed attribute* 

S - 1024 words The system page rite. 

7693 pages * The number of pages J» the relation 

Attribute lengths and setectlvities: 

tmfrtmi A&gliSiSSf&Jk Attytbyte^eh^vJOLl,, 
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Figure 1 Database parameters. 
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Figure 2 Query type table. 
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iterated for 4 steps. 

All the partitioning heuristics haw been progr amm e d in the programming language 
MDL [261 The compiled version* of th»pairwi«fCTnp^^«ri»^and of the fast pairwlse 
grouping heuristic nwrtted riMhe peftttien P 3 in 22 seconds and B second* respectively, »nd 
required the file OT$te»timatt«fl of 74 ami «part*tien*f«apectJverf. 

As it turns out, P 3 , although near optimal, i* not optimal and differs from the 
optimal partition by a neghgibte amount The nrhauifr ve enumcmtfcw procedure (also 
programmed m MDL). found the optimal partition P* of Figure 6 after trying all 4140 
possible partitions. The exhaustive enumeration pro^iire required 2305 seconds of CPU 
time to generate all possible partniem and to cost evaluate them to order to arrive at the 
optimal partition; thia is 23 order* of magnitude flffPfr than the, last pairwise grouping 
heuristic. 

Comparing P 3 of Figure 4 and P 4 of Figure 5, we see that in the near optimal 
partition P 3 , attribute 3 is grouped with attribute 7, while m the optimal partition P*. 
attribute 3 is grouped with attribute 8. Hence when me single attribute 
degrou|iing-regrouping heuristic was tried on P 3 figure 5), it produced the optimal 
partition P 4 in two steps and albjr the evaluation of 24 partitions. (The second step being 
the teaj to see if the heuristic could Improve upon P*} As may be observed from partitions 
P 3 anti P A , none of the other degrouping heuristics applied to P 3 could have resulted in 

C({{1, 2, 3, 4, 5, 6, 7, ft}}) -1881286 pages 
Figure I One-file partition permrm a ncf cost 
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F *g»« 4 The results of the pairwi* grouping; heuristic 
: arkl the fast pMrwise gro u pi ng ' 
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Figure S The result of the single attribute degrouptng-regrouping heuristic. 
P* is the optimal partition found by extenuUve enumeration. 



pV 

The single attribute degrouping heuristic, the double attribute grouping heuristic, and the 
double attribute degrouping-regrouping heuristk aR produced partitions with higher 
performance costs than P 3 . 

We also tried other grouping heuristics starting from P° . The triplewise grouping 
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heuristic resulted in a partition far less optimal than P 8 . The two most attractive pairwise 
grouping heuristic found the optimal partition P\ (lahoMgh it has not always been the case 
that this heuristic would find the optimal partition). Abbougb the k-partU4on pairwise 
grouping heuristic with * - 2 was abb *o ftad tlie apiiniai pantdon P* , it found the 
optimal partition after com evaluating 120 partitions, aaMe^Uy more than the 67 partition 
evaluations required by the comWnaUon of the fM peb^^ grouping heuristic and the 
single attribute degrouping-regrouping heuristic For files with a larger nuinber of attributes. 
the disparity in search cost (between her increases. 

The single attribute ungrouping heuristk was abb) *e produce the optimal partition 
P* when started from the one-file partition. The con»e i giwg n aue nce fa shown in Figure 6. 
This is in accord with our good experience wtth the stogfc atttlbute jona^ouping heuristic 
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Figure 6 The result of the single attribute ungroupiftg heuristic. 
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The single attribute ungrouping heuristic required a total of 32 seconds of CPU time and 
terminated after considering 107 partitions. 

2. The Efficacy of thf TrtYlftl Pftf*vll*i9n 

Ir» the exaniple preseitted above, we see that tlte tibial partition P° in Figure 4 has 
a surprisingly to* pecformawce cert when uu s apai l U with the one-file parttUon of Figure 3. 
In fack the trivial partition in most enatwpks mrm eut to be re b niwl y near optimal, end for 
some combination of database usage pattern and datsbaat paraQiitevi, the performance cost 
of the trivial partition can be tes* than 90* percent of the performance cost of the one-file 
partition, in the course of our CT e er lroeniattqo. it was also observed that the relative 
opttmatity of the trivial partition inc rea sad as the nume m of the attributes in the relation 
increased. We offer the foNowtng exptanattora for the rebttive opttmaHty of the trivial 
partition: 

(a) Most Queries access only a few of the attributes. For every query that accesses a group 
■ of attributes, there are usually other queries that access a subset of this group of 
attributes (plus possibly other attributes) and hence cause the group to break apart (i.e., 
the subset of attributes becomes less attractive to the rest of the group). For example, if 
one query accesses attributes 1 ami 2 and another query accesses attributes 2 and 3, 
and both have the same free^iency, then <d sp snc hng off the selectivities of the queries) it 
is most Hkely that a trivial partition of the attrfbutes l r 2, and 3 is more cost effective 
than a one-file or a*wo-subftfc partition of them. 
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(b) In order that two attributes be grouped together in the same Mttfel, they have to be 
accessed in the same query. The more selective the query &e. the smaller the number of 
tuples that satisfy the query's predicate), the greater. jk9.wfrMimff. t , of grouping the 
attributes. If the query U very selective, grouping the attributes can reduce the number 
of page . accesses to half rt»-;paf#-' 'Mmmh¥m-mim&^mm**M»A if -the- attributes 
were not grouped; therefore the Utri b ua ai used jn a J e hmlixj quegy are very attractive 
to one another. Conversely, if • query » rather umehxtive, the relative advantage 
gamed by grouping its attribu*** is slight. For i v seaj h, If a query i* so wiHteeU ve that 
it requires all the pages of the aabftk, t li tn grou pi ng a* the aetrtbutes in owe sobflte wiil 
incur no more page acoeaje* than separating thorn fcrto severe* subfihs (which indude 

no other a«i^u<a»Kt Th^ 

for selective ooertes, the unseketive queries redue* the overaaY attract! vlty among 
attributes which hat been induced by the selective queries. On the other hand, selective 
queries incur fewer page accesses than unsetecttve queries, and the performance cost of a 
partition is determined to a greater extent by unsetecttve queries. Thus for a usage 
pattern where selective and unselective queries are represented with the same frequency. 

the optimal partition will be close to .the trivial partition hi order to reflect the 

■>■'■■; .'■'■'iv.. ''"■•■ • 

dominance of the unselective queries. 

(c) An attribute that is resolved by sequential searching or by hirking from a TID list 
which contains a large number of TID* (when ccenaured m the total riurnber of tuples 
in the relation) wiH manifest a smaHer attra«rv4ty to the other attributes of the fite. In 
contrast, an attribute that is indexed wiH have a relatively higher attractlvlty measure to 
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the other attributes of the file. This is because when an attribute is resolved by 
sequential searching or by linking from a large TID list, all the values for that attribute 
will have to be retrieved in their entirety. Any attribute that may coexist with this 
attribute In the same subfile will also have to be retrieved in its entirety, even if this 
attribute is not requested in the query that requests the unindexed attribute. Hence (as 
we have noted in Section 6.7), providing indices on attributes will contribute to the 
clustering of the attributes. In the experiments we have conducted (and also in most 
real databases), only a few of the file's attributes were Indexed and most attributes did 
not have an index. Therefore for such cases, the trivial partition will be relatively 
optimal, since the attributes that are not indexed wiH be unattractive or show little 
attractivity to the rest of the attributes of the file, and the optimal partition will be 
composed of a large number of subfiles, each subfile containing only a few attributes. 

The merit of the trivial partition has not been overlooked in practical database 
design. A variant of the trivial partition in a non-flat file implementation of a database is a 
file organization where the main file consists of records of pointers, in which each pointer 
points to the actual attribute value stored in a secondary data area [251 The secondary data 
area is separate from the main file, and values of different attributes are stored separately. In 
this manner, the attributes are separated from one another into their own exclusive areas in a 
way that resembles a trivial partition. Accessing one attribute will therefore not bring in any 
values of other attributes (except for the pointers which are relatively short), but will cause 
one extra page access since the pointer has to be followed to the actual data. In such file 
organizations, data compaction techniques (like eliminating redundant copies of the same 
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attribute value) becomes possible, and are often employed to reduce storage (and also access) 
requirements (but at the expense of computing time). 
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CHAPTER 7 
REMARKS AND FUTOM DIBBOTION8 



In this report we have outlined a tettodtpttve relational database management 
system that perform* attribute fMrttttortinf tor siNgto relatton databases We have afso 
developed* number of compatattonaBy hmilbk ettrtbote par ttUun ing heuristics that select a 
near optimal partition for the attrtbotevwWv »w geer^ paging 

performance of the database. In the next section, we provide suggestions for extending the 
underlying environment to order to sohw more realistic attrlbote partitioning problems, 
without Introducing the need for revamping the attribute partMon heoNstlcs. Thereafter, we 
conclude this report with suggestions tor extending the seepe of attribute partitioning to 
wider problems, along with a discussion of tne relationship between database attribute 
partitioning and other physical database design issues. 

~ Exten »* on » ■*• **»« Attribute* Po-rtitioninfr System 

There are numerous issues and parameters that have to be considered when 
optimizing the physical design of a database in a complex environment In this report we 
have addressed the attribute parttttoning problem bribe context of* model of the database 
management system that Incorporate* a number of those parameters that we feel are more 
important and whose consideration is required in order to have a meaningful model of 
practical database usage. Two of these parameters are attribute selectivity and the blocking 



Chapter 7 - 144 - Remar*« end Futuve Directions 



of tuples into discrete pages. However, it U poHibte to roco r pof at e into our model of the 
database management 'Wfmm'fj tmmm^**ffll^f* t1iMffi'&'WB l i&t increasing the 
complexity of the attribute partitioning problem, and without infroducing the need for 
revamping the heuristics we hate shown to be appro p riate m cw torrent model. In the 
extensions .to our cttffein -ila*^ grouping 

heuristic will stHI be the most viable neurnen: and wni ream its new ■■ opttmattty as 
demonstrated in the coum of our worfc. Ottf^ehtf * bat t d ae am the mettft of our progTa<n 
of experimentation with the heuristic, showing thai I h e pai<lti on pradtwed by the patrwlse 
grouping heuristk is consistently withm « of U* optimal eautu*t»4lor probverw where the 
optimal partition could n# bf idiamtleaV^dlla^ for thttt 

problem). However, tt n possWk that m an ennnOoavnedeVt^^^ 
pair, of blocks wt»^,be;<hi;aa|n»rl^ 

next step (i.e. violating condition 554), thus pred n dtng the ap jd h n i l l i n of she fast palrwhe 
grouping heuristic. ■.:.,-.,*-■» » ♦ <-. 

One parameter that would be desirable to include in our model is the overhead cost 
of accessing a subfile When a e^y U tnade to me database, there Is usually a degree of 
overhead associated with accessing each subfile. T|f overhead may be incurred tn opening 
and closing the subfile. iild^t|Wtheif . .^ii^^|m% naa^ tafijei^^ic-, t#«Hl4iNf|Hh^#v> ^adN^i^ptgr- ^k«r%er 
areas in primary memory for the subfile, .Qverhee^,^bafa^^an,lNr incorporated in our 
model by having .the fUe coat estimator add a ftoed onrhoaa' coot to the total coat figure 
whenever a subfile, is accessed by the method, of a 4|uery. 

One other way we may extend our current model is to differentiate between the cost 
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of accessing different subfiles and to differentiate between the storage cost for different 
subfiles. It is possible to store subfiles on a variety of storage devices with differing storage 
and access costs, in order to further reduce the database p er forma nce cost. For example, a 
subfile that is not frequently accessed and that requires a significant amount of storage may 
be stored on a secondary storage device, other than a disk, that has higher access cost but 
lower storage cost. Differentiation between the costs of subfiles has been the main concern of 
some previous studies that have viewed the attrtbutt partitioning problem in terms of 
allocating attributes to primary and secondary subfiles (Benner |4i Hoffer [18], Eisner and 
Severance [14], and March and Severance fJSl) We may Incorporate differentiated subfiles in 
our model by making the file oast estimator muttph; the number of page accesses to a subfile 
by a factor that reflects the relative cost of accessing the pages of five subfile with respect to 
the other subfiles. Differentiated storage costs for subfiles may also be included m our model 
by adding up the storage costs for each subfile depending on the kind of device the subfile is 
stored on during the time interval b etwee n reparttttonhig points. 

It would be desirable to have the system automatically and optimally determine the 
^partitioning points. Computing the repartitioning points should be based upon the 
consideration that too frequent invocation of the partitioning heuristics will result in 
expending a significant amount of computational resources on the search for a marginally 
better attribute partition, white infrequent invocation of the heuristics may result in degraded 
database performance in the intervening time between reparttttonmg points. Shneiderman 
[33] and Yao et al. [38] have investigated the problem of optimally determining 
reorganization points for the purpose of eliminating overflows in files that are due to tuple 
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insertion and deletion. They also present a number of tecnni<|uet for, th<i purpose. We feel 
that their techniques could be adapted to the prabtom of dete r mining optimal repartittoniftg 
points. 

2. D^reottopf f oy #«|forf BtffM Ufa, 

Even though our investigation of attxibuie partitioning b in many respects -inWe 
comprehensive than previous studies, we have onh/ taenia**! <d the attribute partitioning 
problem within the context of a stogte re k ttion dlNua*, in vi rowrieM, and *>here the relatton 
is accessed through a resected trrteriace <rtth la^ we d capa h^ - To 

fully realtw the flexibility of a retationai database, it is tmaesefy to eonsaier a imrki-relation 
environment together with a high-level nu i qil C MdHral rmjilagti i lHirllnl t hat p ei vn to qoer i ei 
with arbitrary join operation* between rekrtiom in theh qnaitfli alien parti. Also,- it is 
necessary to consider queries that Nave arbfrrafy boolean c iimpj u ia taw e #te. any comb ina tio n of 
conjunctions and disjunctions) in their qualification part, and in which Hie predicates of the 
boolean expression contain other comparison operators besides equality conditions. (Arbitrary 
boolean expressions in queries have been disallowed in our model mainly because of the 
complications that arise in the evaluation of such queries for a partitioned database. Abo, 
the consideration of comparison operators other than equality conditions makes the problem 
of estimating the number of tuples that satisfy a query considerably more complex, though 
not infeasible.) To our knowledge, the attribute partitioning problem for ft mufti-relation 
database with generalized queries has not yet been addressed. 

We have assumed a flat hie physical Impl eme n tation of a relation in our attribute 
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partitioning system. In most practical databases, nonflat file implementation of data is 
assumed for two reasons. I- For the purpose of data compaction, in order to reduce the 
amount of storage required to store the data. Data compaction may be accomplished by 
replacing attribute values with polnteri that point to a common but smaller pool of the actual 
attribute values (data encoding), by using pointers that point to the end of a variable length 
attribute value (so that the maximum storage space does not have to be allocated for each of 
the variable length attributes), or by "factoring out" the tame attribute value from any 
number of tuples that contain the attribute value, in order to eliminate its repetition in more 
than one tuple (as in hierarchical organizations). 2- For the purpose of access enhancement. 
Access enhancement in a nonflat imp l eme ntatio n may be accomplished by the elimination of 
join operations (i.e. searches) on subfiles by having tuples that satisfy a frequent Join 
operation explicitly linked to one another by pointers. We believe that the development of 
attribute partitioning heuristics for nonflat file implementations will net be an easy task; this 
difficulty is compounded by the complexity of query evaluation in such file organizations and 
by the problems of deriving an accurate cost model of such a database management 
environment. Babad [23 and Benner [4] have addressed a limited form of the attribute 
partitioning problem in a nonflat file Implementation of a relation. Specifically, they consider 
the problem of attribute partitioning for file organizations that allow attribute values with 
variable lengths, and where not all subfuptes are of equal length. Schkotnkk [31] considers a 
hierarchical file organization where each node in the tree is an attribute. Page access is 
minimized by partitioning the nodes of the hierarchic file according to the frequency of 
attribute requests by queries and according to the position of the attribute within the 
hierarchy. 
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We have confined our attention to Ufo b l e nu of attribute p art Kk a d ng. Attribute 
clustering, as the logical extension of attribute partitioittiHj, can tesult to considerable access 
(and storage) cost reduction over simple partitioning. As on example where attribute 
clustering may result in access cost reduction, consider attribute! *f, •* end % where a, is 
highly attractive to both « 2 and c?, but where eg and *» art highly unattractive. In 
such a situation it is desirable to redundantly store •, m t«m subfile* in one subfile with « 2 
and in the other subfile with » 3 . F u r the r more , in some cases, attribute clustering can 
paradoxically result ht the reduction of storage cost* if * set of attributes of a relation is 
reproduced in a subfile of the attribute cluster, such that aff die other attributes of the subfile 
are functionally dependent on these reproduced attributes, then tt is possible to replace 
linking as an access path to subtopics of the subfile by an eau4*§nto on the reproduced set of 
attributes. In other word*, for « given subtests, h* order so kwatt toco-subtupte in a subfile, 
the equi-join of subttiptes fn the subfile is taken with the green wbtupJe on the set of 
reproduced attributes. Since the set of reproduced attributes constnutes a key for the subfile, 
the result of the equi-join is a single subtuple which is the co-subtuple of the green subnjple. 
The consequence of eliminating the links U that the s ubf il e t un Ulnlng the reproduced copy 
of the attributes may be c o m pac te d by eliminating alt identical sub t upl e s. In certain cases this 
will result in considerable recovery of spare storage avail the iwtetupl i i of the subfile become 
unique. But there are more direct techniques fee saving storage (such as data encoding) that 
are better for this purpose than attribute ctastetlng. Therefem attribute clustering should be 
primarily considered for tt»e puipose of a«^ co* ramimization. Osman 129) considers a 
special case of attribute clustering in a single relation -database environment where the 
primary key of the relation is reproduced in each of the subfiles. The original relation may 
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then be reconstructed from the equi-join of the collection of the subfiles on the primary key. 
This model may be construed as a variant of attribute partitioning where the linking access 
path is replaced by the equi-join operation, since by reproducing the primary key all 
subtuples become unique, each subnie has n subtuples, and a subtuple has a unique 
co-subtuple in each of the other subfiles. Hoffer 08] also addresses the problem of attribute 
clustering in a flat file implementation of a relation and constructs an integer programming 
formulation of the problem, together with a branch and bound algorithm for its solution. 

A problem related to attribute partitioning and attribute clustering is their dual: 
tuple partitioning (or horizontal partitioning, where rWfffcVts partitioned horizontally by 
tuples.) It is weH known that tuples ire not requested by queries with uniform frequency; in 
actual applications it has been frequently observed that the tuple request distribution follows 
the "80-2Q« rule of thun* B71 This rule states tha* 801 oifithe queries deal with the most 
active 20* of a file. Therefore considerable acteu enh«Kement may be accomplished by 
clustering the frequently accessed tuples together, separate from the Infrequently accessed 
tuples. In its simplest form, tuple partitioning is accomplished in a flat file implementation by 
grouping tuples that are accessed together into horizontal subfile*. The reason such a subfile 
is called a horizontal subfile I* because the table representing the file is being partitioned 
horizontally by its rows. (Similarly, a subfile of an attribute partition may be viewed as a 
vertical subfile.) m tuple partitioning, in order tfntt the access cost of two similarly accessed 
tuples be reduced, the tuples have to be placed in the same page. Tuple partitioning may 
thus be viewed as shuffling the tuples among the page* of theflle, placing similarly accessed 
tuples in the same page, so that total access cost is minimized. We feel that the attribute 
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partitioning heurittte* d»*«*Bp«d to tftts report w« be «wipu*atte«aWy tnfeasible if ever 
applied to the tuple partitioning problem because of ihe sheer number of tuples involved. In 
practical databases, the number of tuples In the database is frequently as targe as 10* , For 
the -same .reason, it appears rather unukety Hurt a s te pwise H i lahwfcMUton - ■ heuristic that 
considers each individual tuple can be de ve l op e d and successfutty applied to the tuple 
partitioning problem Another source of difTlcuxy m the tope* pa rtition i ng problem is that a 
subfile in a tuple partition is limited »t capacity (U. linKUd ea a page). The heuristics we 
have developed db wot take such cceMtraaato |a* *o o msJo^srtPa ■ 

Knuth rjsga and Rive* &g} each describe a heuristic fer clustering tuples according 
to their access rrequerieie*_I|iiiM 

is relocated to the top of the file such Chat the tupk m&i become the Am topic to be searched 
by the next query. In the heuristic .of Rlvest^.a |nple$»»* %«cceasa* Is exchanged wttK the 
immediately preceding tuple. Both heuristics are rather IJiaAed to scc^ siiwe they assume no 
blocking of tuples into pages and they assume that ..du^oo^.uii^^ai^-^ hvaeatfehed. is by 
sequential search. The heuristics also have the drawback that they are very sensitive to the 
order the queries are made to the datahapvaM tb£|^^ partttioiwd 

database can drastically change if the queries arrive at a shf^tdy duTereut order. 

We believe that the tuple p*rtit4otMjM^ *K*ng chaster 

analysis techniques that consider both the attribute values in the tuple and .^?occufHBnc«,;of 
attribute values in queries. Statistics gathered jfer the purpose of tuple partitioning must 
record not only the attributes requested by the queries made to the database, but also the 
attribute values in the equality condition predkates of the query, so that similarly accessed 
tuples may be identified. 
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A slight generalization of the tuple partitioning problem is to the case where tuples 
are stored in pages that need not be completely filled, but may be partly empty. This 
variation of tuple partitioning is even more difficult to solve, but may result in enhanced 
access performance at the cost of an increase in storage requirement Tuple partitioning may 
be further generalized into tuple clustering, where tuples may be redundantly stored in 
several pages, in order to reduce page accesses (while Increasing storage requirements). 

A further generalization of the partitioning problem it the hybrid clustering 
(partitioning^ problem where attribute cfcwtermg (partitioning) and tuple clustering 
(partitioning) are carried out simultaneous^ In tbh problem, a file is partitioned both by its 
attributes and by its tuples such that each subfile has a subset of the attributes and a subset 
of the tuples The subfiles of a hybrid partition need not all be of the same size either in the 
number of attributes or in the number of tuples. One way to picture a flat file partitioned in 
a hybrid manner is as a composition of rectangular mosaics, with varying lengths and widths, 
placed adjacent to one another such that the whole file U covered. The hybrid data clustering 
(partitioning) problem is much larger than either the attribute clustering or the tuple 
clustering problems. For this reason, tu solution requires more powerful heuristics than any 
we have considered. A computationally feasible, yet not necessarily optimal or near optimal, 
approach would be to perform attribute clustering (partitioning) and tuple clustering 
(partitioning) alternately, in order to reach a hybrid cluster that has a locally minimum 
performance cost. 

Another direction of extending attribute partitioning is to consider attribute 
partitioning simultaneously with selecting other file access structures, where the choice of the 
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access structure depends ma *be choice ft «he attrtba te ftattw, and -*tm **m. Wm example, 
we may consider choosing indices tor the attributes of a partitioBed Ute. Its our current 
model, we have assumed *he set of attributes that are taueued to In pr e deter mined, and 
consequently the flte * fiuUmi t a " bated on a fnted *st of notices. » £bc restrtctton of fixed 
indices is removed, the ©verafl per f orm an ce of the database flMnagement system might 
improve when attribute partttfcmmg and mde« jftaflMju a» fiancMumd suup l tan eously. One 
plausible strategy here it to alternately perform atfrtawje |HiMIMw<at larl wadhex selection In a 
stepwise minimization fctruon. Tbf.feaMrtaKjj 
of indices, is particularly usefuUor *uch a strategy. 
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